A linear model is of the form:
\(y = X\beta + \epsilon\)
where \(y\) is a vector of \(n\) observations of the response variable, \(X\) is the model matrix of linear predictors (numeric or categorical) and \(\epsilon\) is a vector of errors, usually assumed to be independent and normally distributed. The issue is to estimate the vector of \(p\) parameters in \(\beta\). This can be done by different methods, such as Least Squares and Maximum Likelihood.
Consider the following data set:
file_address <- 'https://raw.githubusercontent.com/arsilva87/statsbook/main/datasets/camadassolo.csv'
soil <- read.csv(file = file_address)
str(soil) # data set structure'data.frame': 84 obs. of 7 variables:
$ US : num 6.85 10.61 6.63 6.63 10.72 ...
$ DS : num 1.77 1.58 1.79 1.78 1.57 1.56 1.61 1.68 1.65 1.68 ...
$ RP : num 4.37 1.64 5.15 4.83 0.4 0.74 1.64 2.8 2.14 3.99 ...
$ CO : num 9.94 26.89 9.94 9.94 27.8 ...
$ Argila: int 16 13 18 18 12 12 13 14 14 14 ...
$ Tensao: num 83 64.7 93.5 87.5 61 61.9 67.9 69.4 68 72.7 ...
$ Camada: chr "0-20" "0-20" "0-20" "0-20" ...
It consists of soil physical variables (US: soil moisture, DS: bulk density, RP: penetration resistance, CO: organic carbon content, Argila: clay content, Tensao: preconsolidation pressure) taken from 42 sampling points in two layers (0-20, 20-40 cm). Suppose we are interested in fitting a linear model to analyze the effect of the factor ‘Camada’ on ‘RP’. We use the built-in function lm(). But, before that, it is highly advisable to visually examine the data and, for instance, check for outliers.
boxplot(formula = RP ~ Camada, data = soil)Now, using the same syntaxe, let’s fit the model:
model_1 <- lm(formula = RP ~ Camada, data = soil)
model_1
Call:
lm(formula = RP ~ Camada, data = soil)
Coefficients:
(Intercept) Camada20-40
3.1774 -0.5395
The two coefficients are reflecting the model matrix (run model.matrix(model_1) to get \(X\)), which indicates the general estimated mean of \(y\) and the effect of level ‘20-40’ of the factor ‘Camada’. By default, R considers null the first (alphabetic order) level of a factor. Thus, \(\bar{y}_{0-20} = 3.1774\) and \(\bar{y}_{20-40} = 3.1774 - 0.5395 = 2.6379\).
The main purpose with the model fitted earlier is to evaluate the null hypothesis:
\[H_0: \mu_{0-20} = \mu_{20-40}\] that is, the factor ‘Camada’ in \(X\) does not affect the response variable \(y\), in average.
The simplest test for that is \(t\)-Student.
summary(model_1)
Call:
lm(formula = RP ~ Camada, data = soil)
Residuals:
Min 1Q Median 3Q Max
-2.7774 -0.9900 -0.1029 0.6146 3.3626
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.1774 0.1952 16.279 <2e-16 ***
Camada20-40 -0.5395 0.2760 -1.955 0.054 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.265 on 82 degrees of freedom
Multiple R-squared: 0.04452, Adjusted R-squared: 0.03287
F-statistic: 3.821 on 1 and 82 DF, p-value: 0.05403
Also, we can compute the Analysis of Variance (ANOVA) and apply the equivalent \(F\)-test.
anova(model_1)Analysis of Variance Table
Response: RP
Df Sum Sq Mean Sq F value Pr(>F)
Camada 1 6.113 6.1128 3.8206 0.05403 .
Residuals 82 131.196 1.6000
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
With two levels, multiple comparisons’ procedures such as Tukey’s test are not actually needed. We’ll do it anyway to illustrate, using the package emmeans.
library(emmeans)
fit_means <- emmeans(model_1, specs = 'Camada')
fit_means Camada emmean SE df lower.CL upper.CL
0-20 3.18 0.195 82 2.79 3.57
20-40 2.64 0.195 82 2.25 3.03
Confidence level used: 0.95
pairs(fit_means) contrast estimate SE df t.ratio p.value
(0-20) - (20-40) 0.54 0.276 82 1.955 0.0540
Try plot(fit_means).
Those inferential methods are parametric, assuming normality and homogeneity of variances for the residuals, \(\hat{\epsilon}\).
We evaluate \(H_0\): data is normally distributed, executing
res <- residuals(model_1)
shapiro.test(res)
Shapiro-Wilk normality test
data: res
W = 0.98105, p-value = 0.2529
bartlett.test(res ~ Camada, data = soil)
Bartlett test of homogeneity of variances
data: res by Camada
Bartlett's K-squared = 4.6131, df = 1, p-value = 0.03173
It is also advisable to check the dispersion of the Student residuals against the fitted values \(\hat{y}\) to track trends on data, heteroscedasticity.
st_res <- rstudent(model_1)
y_fit <- fitted(model_1)Also, residual outliers can be identified using a 95% confidence interval calculated with the \(t\)-Student distribution:
n <- nrow(soil)
li.res <- qt(0.025/(2*n), df = n - 2)
ls.res <- qt(1 - 0.025/(2*n), df = n - 2)
plot(x = y_fit, y = st_res, ylim = c(-4, 4))
abline(h = c(li.res, ls.res), col = 'red')Suppose that the predictor of interest in \(X\) is not categorical, but numeric. In the example, say not ‘Camada’ but ‘US’, soil moisture. The response variable is still the same, ‘RP’.
plot(RP ~ US, soil,
ylab = 'Penetration resistance (kPa)',
xlab = 'Soil moisture (%)')We could try to describe that relationship using a line, \(y = \beta_0 + \beta_1x + \epsilon\), where \(\beta_0\) is the intercept and \(\beta_1\) is the angular coefficient.
model_2 <- lm(RP ~ US, soil)
model_2
Call:
lm(formula = RP ~ US, data = soil)
Coefficients:
(Intercept) US
7.860 -0.531
b <- coef(model_2)
curve(b[1] + b[2]*x, from = 6, to = 14)Now let’s use the \(t\)-test to evaluate
\[H_0: \beta_1 = 0\]
summary(model_2)
Call:
lm(formula = RP ~ US, data = soil)
Residuals:
Min 1Q Median 3Q Max
-1.76681 -0.56160 -0.08894 0.43859 2.74848
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.8597 0.4199 18.72 <2e-16 ***
US -0.5311 0.0441 -12.04 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.7777 on 82 degrees of freedom
Multiple R-squared: 0.6388, Adjusted R-squared: 0.6344
F-statistic: 145 on 1 and 82 DF, p-value: < 2.2e-16
with which we can observe also the coefficient of determination, R-squared.
new_us <- seq(6, 14, length.out = 50)
pred <- predict(model_2,
newdata = data.frame(US = new_us),
interval = 'confidence', level = 0.95)
matplot(pred, type = 'l')soil, fit a linear regression model for each level of ‘Camada’. Fit the lines on the dispersion plot.