This document gives an overview of how the F-statistic in a (simple and multiple) linear model is calculated.
Consider the following data set:
# Load data
blomkvist <- read_csv("blomkvist.csv")
# Select variables needed
blomkvist <- select(blomkvist, id, age, rt = rt_hand_d)
# Remove rows with missing data
blomkvist <- drop_na(blomkvist)
Say we are assuming that reaction time rt dependents on
age. Using a linear regression lm we can model such a
relationship.
m_rt <- lm(rt ~ age, data = blomkvist)
The inferential summary of this model can be obtained using the
summary function applied to the model object
m_rt.
summary(m_rt)
Call:
lm(formula = rt ~ age, data = blomkvist)
Residuals:
Min 1Q Median 3Q Max
-371.30 -93.81 -23.00 60.94 838.70
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 314.2601 26.7228 11.76 <2e-16 ***
age 5.7783 0.4467 12.93 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 149.3 on 264 degrees of freedom
Multiple R-squared: 0.3879, Adjusted R-squared: 0.3856
F-statistic: 167.3 on 1 and 264 DF, p-value: < 2.2e-16
The last line in this summary output shows the F-statistic being
167.3. The F-statistic – and the metrics used to calculate this
statistic – can be obtained using the anova function.
anova(m_rt)
Analysis of Variance Table
Response: rt
Df Sum Sq Mean Sq F value Pr(>F)
age 1 3727069 3727069 167.29 < 2.2e-16 ***
Residuals 264 5881593 22279
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The null hypothesis that this F-test is challenging is that the variance explained by all predictors in the model (i.e. age in our case) is simultaneously nil.
The p-value – or the probability of observing the F-value above or
something more larger assuming the null hypothesis above – can be
extracted (ignore the NA value in the output):
anova(m_rt)$`Pr(>F)`
[1] 5.697268e-30 NA
To turn off turn off the scientific notation here you can do
options(scipen = 100, digits = 22)
which allows you to see the actual p-value. It’s small. In other words the F statistic above is unlikely if the null hypothesis was true.
anova(m_rt)$`Pr(>F)`
[1] 0.000000000000000000000000000005697268402077479818767
[2] NA
The pf function gives you the probability of the F value
167.3 in an F distribution with two different degrees of freedom, one
for the number of predictors df1 and one for the sample
size minus number of predictors df2.
pf(167.29, df1 = 1, df2 = 264, lower.tail = FALSE)
[1] 0.000000000000000000000000000005701588596895334963549
In a simple linear regression (one predictor), the F-statistic (value
above 167.29) compares how much of the variance in the rt
data is explained by the predictor vs. how much is left unexplained. If
the predictor explains a lot of variance relative to the “noise,” the F
value becomes large. The F value measures the ratio of explained
variance (by the regression) to unexplained variance (the residual
error), adjusted for degrees of freedom. So the intuition follows this
form:
\[
F = \frac{\text{variance explained by predictor}}{\text{variance left
over}}
\] A high F means the predictor explains much more variance than
is left in the residuals. This is the same as obtaining the ratio of the
Mean Squares (Mean Sq in the output above) of age and the
Mean Squares of the residuals:
3727069 / 22279
[1] 167.290677319448803928
The Mean Squares are just the Sum of Squares (Sum Sq in
the output above) over the corresponding degrees of freedom, like a
arithmetic mean is just the sum of observations over the sample size. So
the Mean Squares of age is
3727069 / 1
[1] 3727069
and for the residuals
5881593 / 264
[1] 22278.76136363636396709
So the F-value and the significance test depends on the Sum of
Squares (Sum Sq column) which is the sum of the squared
residuals where the residuals are the difference between the observed
data and the model’s prediction for each data point. The bottom value in
the Sum Sq column is the residual sum of squares (RSS) so a
measures of how much variance wasn’t explained by the model predictors,
hence
(rss <- sum(residuals(m_rt)^2))
[1] 5881593.297219597734511
The top value in the Sum Sq column is the explained sum
of squares (ESS) so the sum of the squared differences of model
predictions and sample mean.
(ess <- sum((predict(m_rt) - mean(blomkvist$rt))^2))
[1] 3727069.032354341354221
The sum of ESS and RSS is the total sum of squares (TSS)
(tss <- ess + rss)
[1] 9608662.329573938623071
which can be calculated from the raw data.
sum((blomkvist$rt - mean(blomkvist$rt))^2)
[1] 9608662.329573938623071
Therefore, from Sum Sq column in the anova
output above I can calculate the \(R^2\) value shown in the
summary output as
\[ R^2 = \frac{\text{ESS}}{(\text{ESS} + \text{RSS)}} = \frac{\text{ESS}}{\text{TSS}} \]
(r2 <- ess / tss)
[1] 0.3878863575924625939351
and therefore also Adjusted \(R^2\)
– which can be found in the summary outout – following
\[ R^2_\text{Adj} = 1 - (1 - R^2) \cdot \frac{n-1}{n-K-1}, \]
where \(n\) is the sample size (n = 266) and \(K\) is the number of predictors (i.e. 1 for age).
K <- 1
n <- nrow(blomkvist)
1 - (1 - r2) * (n - 1) / (n - K - 1)
[1] 0.3855677453106158836249
In a simple linear regression model the F statistic is the ratio of ESS and RSS relative for their respective degrees of freedom
\[ F = \frac{\text{ESS} / \text{df}_1}{\text{RSS}/\text{df}_2}. \]
You can think of the F value as the signal-to-noise ratio for the regression model where the signal is variance explained by the predictor and the noise is variance not explained (residuals).
In situations where you have two different models, for example a simple linear regression (model 0) like the one above and a multiple linear regression model that has the same predictor as the simple linear regression and another predictor (model 1), this formula expands to
\[ F = \frac{(\text{RSS}_0 - \text{RSS}_1)/(\text{df}_0 - \text{df}_1)}{\text{RSS}_1/\text{df}_1}, \]
where the subscripts 0 and 1 indicate model 0 and model 1, so \(\text{RSS}_0\) is the residual sum of squares of model 0 and \(\text{RSS}_1\) is the residual sum of squares of model 1; their difference gives how much variance was explained by the model complex model, hence rending its ESS value.