library(faraway)

Inference

With our model estimation complete, we can now move on to making inferences. It’s important to note that to estimate the intercept \(\beta_0\) and the parameters \(\beta_i\) for \(i=1,\dots, p\), no particular assumptions are required, as OLS is simply an algebraic tool. However, if we wish to compute confidence intervals (CI) or perform hypothesis tests, we must make the assumption that the error terms are normally distributed. This is an important consideration when drawing conclusions from our model and ensuring the reliability of our results.

Test about multiple parameters

Given several predictors you might want to test weather all are needed. Consider a larger model, \(\Omega\) of dimension \(p\), and a smaller model, \(\omega\) of \(q\) dimension, which consists of a subset of the predictors that are in \(\Omega\). We can obtain the following \(F\) statistic

\[ F=\frac{RSS_{\omega}-RSS_{\Omega}/(p-q)}{RSS_{\Omega}/(n-p)}\sim F_{p-q,n-p} \]

Details about the derivation of this statistic are out the scope of this course. However, \(F\) is obtained by a Likelihood ratio test, therefore by assuming that the error terms are Normally distributed. Hence, this test and most of the test we are going to see in this section can not be used if there is a strong violation of the Normality assumption.

In R, many tasks can be simplified. To illustrate this point, we will use the Galapagos Islands dataset to fit the same model as before, with the number of species as the response variable and the geographic variables as predictors.

lmod <- lm(Species ~ Area+ Elevation+ Nearest + Scruz + Adjacent, data=gala)

We then fit the model with only the intercept. The function that we are going to use is anova() this will do the job for us!

nullmodel <- lm(Species ~ 1, data=gala)
anova( nullmodel, lmod)
## Analysis of Variance Table
## 
## Model 1: Species ~ 1
## Model 2: Species ~ Area + Elevation + Nearest + Scruz + Adjacent
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1     29 381081                                  
## 2     24  89231  5    291850 15.699 6.838e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

we can see directly the result of the test, where the null here is

\[ H_0=\beta_1=\beta_2=\beta_3=\beta_4=\beta_5=0\\ H_1= \text{at least one is different from zero} \]

we have a p-value of \(6.838e-07\), we reject the null. This can be verified using the formula derived above

rss0 <- deviance(nullmodel)
rss <- deviance(lmod)
df0 <- df.residual(nullmodel)
df <- df.residual(lmod)

(fstatistic <- ((rss0-rss)/(df0-df))/(rss/df))
## [1] 15.69941

we can see that this result is consistent with the one obtained using anova().

1-pf(fstatistic, df0-df, df)
## [1] 6.837893e-07

the p-value is consistent as well.

Testing one predictor

Lets suppose to test

\[ H_o:\beta_i=0\,\,\, vs \,\,\, H_1:\beta_i\ne 0 \] Let \(\Omega\) be the model with all the predictors of interest which has \(p\) parameters and let \(\omega\) be the model with all the same predictors except predictor \(i\) (the one we are testing). Let’s test wheater Area can be dropped from the full model by testing the hypothesis that the corrensponding parameter is zero.

lmod <- lm(Species ~ Area+ Elevation+ Nearest + Scruz + Adjacent, data=gala)
lmods <- lm(Species ~  Elevation+ Nearest + Scruz + Adjacent, data=gala)
anova(lmods, lmod)
## Analysis of Variance Table
## 
## Model 1: Species ~ Elevation + Nearest + Scruz + Adjacent
## Model 2: Species ~ Area + Elevation + Nearest + Scruz + Adjacent
##   Res.Df   RSS Df Sum of Sq      F Pr(>F)
## 1     25 93469                           
## 2     24 89231  1    4237.7 1.1398 0.2963

the p-value of 0.3 tells that we cannot reject the null. An alternative ( equivalent ) approach is to use \(t-\)statistic for testing the hypothesis:

\[ t_i=\hat{\beta}_i/se(\hat{\beta}_i) \]

this test statistics can be proved to be distributed as a t-student with \(n-p\) degree of freedom

Testing a pair of predictors

In statistical terms, if we want to investigate whether the area of either the current island or the adjacent island has any relationship with the response, we can phrase the problem as follows:

\[ H_o:\beta_{Area}=\beta_{Adjacent}=0 \]

also this test can be carried out with anova()

lmods <- lm(Species ~  Elevation+ Nearest + Scruz , data=gala)
anova(lmod, lmods)
## Analysis of Variance Table
## 
## Model 1: Species ~ Area + Elevation + Nearest + Scruz + Adjacent
## Model 2: Species ~ Elevation + Nearest + Scruz
##   Res.Df    RSS Df Sum of Sq      F  Pr(>F)   
## 1     24  89231                               
## 2     26 158292 -2    -69060 9.2874 0.00103 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The null is rejected as the p-value is too small.

Testing a subspace

Lets suppose that we want to test

\[ H_o: \beta_{Area}=\beta_{Adjacent} \] we can test this hypothesis again with anova

lmods <- lm(Species ~ I(Area+Adjacent)+ Elevation+ Nearest + Scruz , data=gala)
anova(lmods, lmod)
## Analysis of Variance Table
## 
## Model 1: Species ~ I(Area + Adjacent) + Elevation + Nearest + Scruz
## Model 2: Species ~ Area + Elevation + Nearest + Scruz + Adjacent
##   Res.Df    RSS Df Sum of Sq     F  Pr(>F)  
## 1     25 109591                             
## 2     24  89231  1     20360 5.476 0.02793 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The function I() ensures that the argument is evaluated rather than interpreted as part of the model formula. The p-value suggest that the null can be rejected.

Yet, assume we want to test

\[ H_o:\beta_{Elevation}=0.5 \]

This can be set in R by using an offset

lmods <- lm(Species ~ Area+ offset(0.5*Elevation)+ Nearest + Scruz + Adjacent, data=gala)
anova(lmod, lmods)
## Analysis of Variance Table
## 
## Model 1: Species ~ Area + Elevation + Nearest + Scruz + Adjacent
## Model 2: Species ~ Area + offset(0.5 * Elevation) + Nearest + Scruz + 
##     Adjacent
##   Res.Df    RSS Df Sum of Sq      F   Pr(>F)   
## 1     24  89231                                
## 2     25 131312 -1    -42081 11.318 0.002574 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

we see that the p- value is small, so the null have to be rejected.

Confidence Intervals

In our domain, Confidence Intervals (CIs) are an incredibly powerful tool that allow us to quantify the uncertainty in our estimates of \(\beta\). They provide us with a range of plausible values for the true value of the parameter, which can help us make informed decisions and draw accurate conclusions from our data.

\[ \hat{\beta_i}\pm t_{n-p, \alpha/2} se(\hat{\beta}) \]

lmod <- lm(Species ~ Area+ Elevation+ Nearest + Scruz + Adjacent, data=gala)
summary(lmod)
## 
## Call:
## lm(formula = Species ~ Area + Elevation + Nearest + Scruz + Adjacent, 
##     data = gala)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -111.679  -34.898   -7.862   33.460  182.584 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  7.068221  19.154198   0.369 0.715351    
## Area        -0.023938   0.022422  -1.068 0.296318    
## Elevation    0.319465   0.053663   5.953 3.82e-06 ***
## Nearest      0.009144   1.054136   0.009 0.993151    
## Scruz       -0.240524   0.215402  -1.117 0.275208    
## Adjacent    -0.074805   0.017700  -4.226 0.000297 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 60.98 on 24 degrees of freedom
## Multiple R-squared:  0.7658, Adjusted R-squared:  0.7171 
## F-statistic:  15.7 on 5 and 24 DF,  p-value: 6.838e-07

we can conscruct manually \(95\%\) CIs for \(\beta_{Area}\) for which we need the \(2.5\%\) and \(97.5\%\) percentiles of the \(t-\) distribution with \(30-6=24\) degrees of freedom.

qt(0.975, 24)
## [1] 2.063899
-0.02394 +c(-1,1)*qt(0.975, 24)*0.02242
## [1] -0.07021261  0.02233261

we can note that the CI contains zero, this indicates that the null hypothesis \(H_o:\beta_{Area}=0\) would not be rejected at the \(5\%\) level.

A convenient way to obtain all the univariate intervals is

confint(lmod)
##                   2.5 %      97.5 %
## (Intercept) -32.4641006 46.60054205
## Area         -0.0702158  0.02233912
## Elevation     0.2087102  0.43021935
## Nearest      -2.1664857  2.18477363
## Scruz        -0.6850926  0.20404416
## Adjacent     -0.1113362 -0.03827344

Exercises

  1. usin the prostate data, fit a model with lpsa as response and the other variables as predictors.
  1. Compute \(90\) and \(95\%\) CIs for the parameter associated with age.
  2. Using just these intervals, what could we have deduced abou the \(p-value\) for age in the regression summary ?
  1. Using the sat data
  1. Fit a model total sat score as the response and expand, ratio and salary as predictors. test the hypothesis that \(\beta_{salary}=0\). Test that \(\beta_{salary}=\beta_{ratio}=\beta_{expand}=0\). Do any of these predictors have an effect on the response ?
  2. Now add takers to the model. Test the hypothesis that \(\beta_{takers}=0\). compare this model to the previous one using \(F-test\). Demonstrate that the \(F- test\) and the \(t-test\) here are equivalent.
  1. Find a formula relating \(R^2\) and the $F- test $ for the regression ( please try to attempt this alone, not using external help or AI !!!!. )

References