This document gives an overview of how the F-statistic in a (simple and multiple) linear model is calculated.

Consider the following data set:

# Load data
blomkvist <- read_csv("blomkvist.csv")
# Select variables needed
blomkvist <- select(blomkvist, id, age, rt = rt_hand_d)
# Remove rows with missing data
blomkvist <- drop_na(blomkvist)

Say we are assuming that reaction time rt dependents on age. Using a linear regression lm we can model such a relationship.

m_rt <- lm(rt ~ age, data = blomkvist)

The inferential summary of this model can be obtained using the summary function applied to the model object m_rt.

summary(m_rt)

Call:
lm(formula = rt ~ age, data = blomkvist)

Residuals:
    Min      1Q  Median      3Q     Max 
-371.30  -93.81  -23.00   60.94  838.70 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 314.2601    26.7228   11.76   <2e-16 ***
age           5.7783     0.4467   12.93   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 149.3 on 264 degrees of freedom
Multiple R-squared:  0.3879,    Adjusted R-squared:  0.3856 
F-statistic: 167.3 on 1 and 264 DF,  p-value: < 2.2e-16

The last line in this summary output shows the F-statistic being 167.3. The F-statistic – and the metrics used to calculate this statistic – can be obtained using the anova function.

anova(m_rt)
Analysis of Variance Table

Response: rt
           Df  Sum Sq Mean Sq F value    Pr(>F)    
age         1 3727069 3727069  167.29 < 2.2e-16 ***
Residuals 264 5881593   22279                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The null hypothesis that this F-test is challenging is that the variance explained by all predictors in the model (i.e. age in our case) is simultaneously nil.

The p-value – or the probability of observing the F-value above or something more larger assuming the null hypothesis above – can be extracted (ignore the NA value in the output):

anova(m_rt)$`Pr(>F)`
[1] 5.697268e-30           NA

To turn off turn off the scientific notation here you can do

options(scipen = 100, digits = 22)

which allows you to see the actual p-value. It’s small. In other words the F statistic above is unlikely if the null hypothesis was true.

anova(m_rt)$`Pr(>F)`
[1] 0.000000000000000000000000000005697268402077479818767
[2]                                                    NA

The pf function gives you the probability of the F value 167.3 in an F distribution with two different degrees of freedom, one for the number of predictors df1 and one for the sample size minus number of predictors df2.

pf(167.29, df1 = 1, df2 = 264, lower.tail = FALSE)
[1] 0.000000000000000000000000000005701588596895334963549

In a simple linear regression (one predictor), the F-statistic (value above 167.29) compares how much of the variance in the rt data is explained by the predictor vs. how much is left unexplained. If the predictor explains a lot of variance relative to the “noise,” the F value becomes large. The F value measures the ratio of explained variance (by the regression) to unexplained variance (the residual error), adjusted for degrees of freedom. So the intuition follows this form:

\[ F = \frac{\text{variance explained by predictor}}{\text{variance left over}} \] A high F means the predictor explains much more variance than is left in the residuals. This is the same as obtaining the ratio of the Mean Squares (Mean Sq in the output above) of age and the Mean Squares of the residuals:

3727069 / 22279
[1] 167.290677319448803928

The Mean Squares are just the Sum of Squares (Sum Sq in the output above) over the corresponding degrees of freedom, like a arithmetic mean is just the sum of observations over the sample size. So the Mean Squares of age is

3727069 / 1
[1] 3727069

and for the residuals

5881593 / 264
[1] 22278.76136363636396709

So the F-value and the significance test depends on the Sum of Squares (Sum Sq column) which is the sum of the squared residuals where the residuals are the difference between the observed data and the model’s prediction for each data point. The bottom value in the Sum Sq column is the residual sum of squares (RSS) so a measures of how much variance wasn’t explained by the model predictors, hence

(rss <- sum(residuals(m_rt)^2))
[1] 5881593.297219597734511

The top value in the Sum Sq column is the explained sum of squares (ESS) so the sum of the squared differences of model predictions and sample mean.

(ess <- sum((predict(m_rt) - mean(blomkvist$rt))^2))
[1] 3727069.032354341354221

The sum of ESS and RSS is the total sum of squares (TSS)

(tss <- ess + rss)
[1] 9608662.329573938623071

which can be calculated from the raw data.

sum((blomkvist$rt - mean(blomkvist$rt))^2)
[1] 9608662.329573938623071

Therefore, from Sum Sq column in the anova output above I can calculate the \(R^2\) value shown in the summary output as

\[ R^2 = \frac{\text{ESS}}{(\text{ESS} + \text{RSS)}} = \frac{\text{ESS}}{\text{TSS}} \]

(r2 <- ess / tss)
[1] 0.3878863575924625939351

and therefore also Adjusted \(R^2\) – which can be found in the summary outout – following

\[ R^2_\text{Adj} = 1 - (1 - R^2) \cdot \frac{n-1}{n-K-1}, \]

where \(n\) is the sample size (n = 266) and \(K\) is the number of predictors (i.e. 1 for age).

K <- 1
n <- nrow(blomkvist)
1 - (1 - r2) * (n - 1) / (n - K - 1)
[1] 0.3855677453106158836249

In a simple linear regression model the F statistic is the ratio of ESS and RSS relative for their respective degrees of freedom

\[ F = \frac{\text{ESS} / \text{df}_1}{\text{RSS}/\text{df}_2}. \]

You can think of the F value as the signal-to-noise ratio for the regression model where the signal is variance explained by the predictor and the noise is variance not explained (residuals).

In situations where you have two different models, for example a simple linear regression (model 0) like the one above and a multiple linear regression model that has the same predictor as the simple linear regression and another predictor (model 1), this formula expands to

\[ F = \frac{(\text{RSS}_0 - \text{RSS}_1)/(\text{df}_0 - \text{df}_1)}{\text{RSS}_1/\text{df}_1}, \]

where the subscripts 0 and 1 indicate model 0 and model 1, so \(\text{RSS}_0\) is the residual sum of squares of model 0 and \(\text{RSS}_1\) is the residual sum of squares of model 1; their difference gives how much variance was explained by the model complex model, hence rending its ESS value.