LM Framework

A linear model is of the form:

\(y = X\beta + \epsilon\)

where \(y\) is a vector of \(n\) observations of the response variable, \(X\) is the model matrix of linear predictors (numeric or categorical) and \(\epsilon\) is a vector of errors, usually assumed to be independent and normally distributed. The issue is to estimate the vector of \(p\) parameters in \(\beta\). This can be done by different methods, such as Least Squares and Maximum Likelihood.

Consider the following data set:

file_address <- 'https://raw.githubusercontent.com/arsilva87/statsbook/main/datasets/camadassolo.csv'
soil <- read.csv(file = file_address)
str(soil)   # data set structure
'data.frame':   84 obs. of  7 variables:
 $ US    : num  6.85 10.61 6.63 6.63 10.72 ...
 $ DS    : num  1.77 1.58 1.79 1.78 1.57 1.56 1.61 1.68 1.65 1.68 ...
 $ RP    : num  4.37 1.64 5.15 4.83 0.4 0.74 1.64 2.8 2.14 3.99 ...
 $ CO    : num  9.94 26.89 9.94 9.94 27.8 ...
 $ Argila: int  16 13 18 18 12 12 13 14 14 14 ...
 $ Tensao: num  83 64.7 93.5 87.5 61 61.9 67.9 69.4 68 72.7 ...
 $ Camada: chr  "0-20" "0-20" "0-20" "0-20" ...

It consists of soil physical variables (US: soil moisture, DS: bulk density, RP: penetration resistance, CO: organic carbon content, Argila: clay content, Tensao: preconsolidation pressure) taken from 42 sampling points in two layers (0-20, 20-40 cm). Suppose we are interested in fitting a linear model to analyze the effect of the factor ‘Camada’ on ‘RP’. We use the built-in function lm(). But, before that, it is highly advisable to visually examine the data and, for instance, check for outliers.

boxplot(formula = RP ~ Camada, data = soil)

Now, using the same syntaxe, let’s fit the model:

model_1 <- lm(formula = RP ~ Camada, data = soil)
model_1

Call:
lm(formula = RP ~ Camada, data = soil)

Coefficients:
(Intercept)  Camada20-40  
     3.1774      -0.5395  

The two coefficients are reflecting the model matrix (run model.matrix(model_1) to get \(X\)), which indicates the general estimated mean of \(y\) and the effect of level ‘20-40’ of the factor ‘Camada’. By default, R considers null the first (alphabetic order) level of a factor. Thus, \(\bar{y}_{0-20} = 3.1774\) and \(\bar{y}_{20-40} = 3.1774 - 0.5395 = 2.6379\).

Statistical tests

The main purpose with the model fitted earlier is to evaluate the null hypothesis:

\[H_0: \mu_{0-20} = \mu_{20-40}\] that is, the factor ‘Camada’ in \(X\) does not affect the response variable \(y\), in average.

\(t\)-Student’s test

The simplest test for that is \(t\)-Student.

summary(model_1)

Call:
lm(formula = RP ~ Camada, data = soil)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.7774 -0.9900 -0.1029  0.6146  3.3626 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   3.1774     0.1952  16.279   <2e-16 ***
Camada20-40  -0.5395     0.2760  -1.955    0.054 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.265 on 82 degrees of freedom
Multiple R-squared:  0.04452,   Adjusted R-squared:  0.03287 
F-statistic: 3.821 on 1 and 82 DF,  p-value: 0.05403

ANOVA

Also, we can compute the Analysis of Variance (ANOVA) and apply the equivalent \(F\)-test.

anova(model_1)
Analysis of Variance Table

Response: RP
          Df  Sum Sq Mean Sq F value  Pr(>F)  
Camada     1   6.113  6.1128  3.8206 0.05403 .
Residuals 82 131.196  1.6000                  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Multiple comparisons of means

With two levels, multiple comparisons’ procedures such as Tukey’s test are not actually needed. We’ll do it anyway to illustrate, using the package emmeans.

library(emmeans)
fit_means <- emmeans(model_1, specs = 'Camada')
fit_means
 Camada emmean    SE df lower.CL upper.CL
 0-20     3.18 0.195 82     2.79     3.57
 20-40    2.64 0.195 82     2.25     3.03

Confidence level used: 0.95 
pairs(fit_means)
 contrast         estimate    SE df t.ratio p.value
 (0-20) - (20-40)     0.54 0.276 82   1.955  0.0540

Try plot(fit_means).

Checking models` assumptions

Those inferential methods are parametric, assuming normality and homogeneity of variances for the residuals, \(\hat{\epsilon}\).

Shapiro-Wilk’s test

We evaluate \(H_0\): data is normally distributed, executing

res <- residuals(model_1)
shapiro.test(res)

    Shapiro-Wilk normality test

data:  res
W = 0.98105, p-value = 0.2529

Bartlett`s test

bartlett.test(res ~ Camada, data = soil)

    Bartlett test of homogeneity of variances

data:  res by Camada
Bartlett's K-squared = 4.6131, df = 1, p-value = 0.03173

Residual analysis

It is also advisable to check the dispersion of the Student residuals against the fitted values \(\hat{y}\) to track trends on data, heteroscedasticity.

st_res <- rstudent(model_1)
y_fit <- fitted(model_1)

Also, residual outliers can be identified using a 95% confidence interval calculated with the \(t\)-Student distribution:

n <- nrow(soil)
li.res <- qt(0.025/(2*n), df = n - 2)
ls.res <- qt(1 - 0.025/(2*n), df = n - 2)
plot(x = y_fit, y = st_res, ylim = c(-4, 4))
abline(h = c(li.res, ls.res), col = 'red')

Linear regression

Suppose that the predictor of interest in \(X\) is not categorical, but numeric. In the example, say not ‘Camada’ but ‘US’, soil moisture. The response variable is still the same, ‘RP’.

plot(RP ~ US, soil, 
     ylab = 'Penetration resistance (kPa)',
     xlab = 'Soil moisture (%)')

We could try to describe that relationship using a line, \(y = \beta_0 + \beta_1x + \epsilon\), where \(\beta_0\) is the intercept and \(\beta_1\) is the angular coefficient.

model_2 <- lm(RP ~ US, soil)
model_2

Call:
lm(formula = RP ~ US, data = soil)

Coefficients:
(Intercept)           US  
      7.860       -0.531  
b <- coef(model_2)
curve(b[1] + b[2]*x, from = 6, to = 14)

Now let’s use the \(t\)-test to evaluate

\[H_0: \beta_1 = 0\]

summary(model_2)

Call:
lm(formula = RP ~ US, data = soil)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.76681 -0.56160 -0.08894  0.43859  2.74848 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   7.8597     0.4199   18.72   <2e-16 ***
US           -0.5311     0.0441  -12.04   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.7777 on 82 degrees of freedom
Multiple R-squared:  0.6388,    Adjusted R-squared:  0.6344 
F-statistic:   145 on 1 and 82 DF,  p-value: < 2.2e-16

with which we can observe also the coefficient of determination, R-squared.

Make predictions with confidence bands

new_us <- seq(6, 14, length.out = 50)
pred <- predict(model_2, 
                newdata = data.frame(US = new_us),
                interval = 'confidence', level = 0.95)
matplot(pred, type = 'l')

Miscellanea

Exercises


License: CC BY 4.0