Linear Regression Explained: Degrees of Freedom

What does the term “degrees of freedom” mean in statistics?

The term “degrees of freedom” is based in pure math – how many parameters (degrees) in a given system of equations are “free” to take any value? Or more abstractly, “the dimension of the domain of a random vector” (1).

In statistics, the best way to think about degrees of freedom is a measure to answer the question, “Do I have enough observations in my dataset to confidently make a statistical claim?”

Degrees of Freedman when Evaluating a Sample Mean

In the simplest terms, we can think about the impact of degrees of freedom for a One-Sample Test, e.g. a sample mean.

The general formula for determining degrees of freedom is \(n - p\), where \(n\) is the sample size and \(p\) is the number of independent parameters being estimated. For just the sample mean, p = 1.

The formula for determining a confidence interval is \(\bar{x} \pm t(\sigma / \sqrt n\) using a t-distribution. At first glance, degrees of freedom doesn’t appear to be in this equation. Don’t forget, however, that we also need to know the degrees of freedom, \(n-1\) to look up the correct value of \(t\). Note: If you are thinking that you don’t need to know the value of the degrees of freedom if using a z-distribution, e.g. normal, this is correct! Degrees of freedom only comes into play for certain kinds of analyses, e.g. looking at small samples suitable for a t-test, or looking at a Chi Square test of difference in means.

Let’s look at an example using R’s built-in dataset, mtcars and see how degrees of freedom affects our confidence in estimating a sample. Imagine you are a botanist taking samples of the sepal lengths of mature irises. How many irises do you need to measure to be confident in your result?

Ns <- 2:40
set.seed(123)

ci1 <- c()
ci2 <- c()
mean_samps <- c()

for(n in Ns){
  samp <- sample(iris$Sepal.Length, n)
  ci <- t.test(samp, conf.level=0.95, mu=mean(samp))$conf.int
  ci1 <- append(ci1, ci[1])
  ci2 <- append(ci2, ci[2])
  mean_samps <- append(mean_samps, mean(samp))
}

df <- data.frame(mean_samps, Ns, ci1, ci2) %>%
  mutate(`Degrees of Freedom` = Ns-1)

ggplot(df, aes(x = `Degrees of Freedom`, y = mean_samps)) +
  geom_point(size = 4) +
  geom_errorbar(aes(ymax = ci2, ymin = ci1))+
  ylab("Sample Mean")+
  ggtitle("Impact of Degrees of Freedom on Confidence Interval for Sample Mean")

As you can see, the higher the degrees of freedom (which really just means, the more measurements collected), the more smaller the 95% confidence interval around our sample mean is. In other words, the higher our sample size, the more sure we are that our sample mean is really close to the population mean.

Note, too, however, that this effect gets diminishing returns pretty quickly–this is why many textbooks will use a number like 30 to say any sample greater than 30 can be considered “large”.

Degrees of Freedom and Linear Regression

In the context of linear regression, the degrees of freedom measures of the relationship between how many observations are in your sample (n) and how many predictors you would like to include in your regression (p).

Simply put, the more data you have, the more predictors you can include in your regression analysis. On the flip side, the fewer predictors you are trying to evaluate, the less data you need to confidently make a statistical claim.

Let’s take a look at the mtcars dataset, and see how these factors affect a simple linear regression.

Imagine you are working at a car company (but know more about statistics than about cars!) and are trying to figure out what factors impact the mileage a car gets in order to maximize features that your customers want while minimizing fuel consumption.

How many predictors can we use given our limited data set? mtcars has 32 observations with 11 features.

Ns <- 2:32
p_values <- c()
for(n in Ns){
  fit <- lm(mpg ~ cyl + disp + hp + drat + wt + qsec + vs + am + gear + carb, sample_n(mtcars, n))
  f <- summary(fit)$fstatistic
  p_values = append(p_values, pf(f[1], f[2], f[3], lower.tail = FALSE))
}

df2 = data.frame(Ns, p_values)

ggplot(df2, aes(x = Ns, y = p_values)) +
  geom_point(size = 4) +
  geom_vline(xintercept = 12, linetype="dotted", color ="red") +
  geom_hline(yintercept=0.05, linetype="dotted", color="blue")+
  annotate(geom="text", 
           x=8, y=.2, 
           label="12 Cases Required to Fit 11-Term Model", 
           color="red")+
  annotate(geom="text", 
           x=5, y=0.06, 
           label="95% Confidence Level", 
           color="blue")+
  ylab("Model p-value")+
  ggtitle("Impact of Degrees of Freedom on Confidence Level for 11-Term Linear Regession Model")

## Warning: Removed 10 rows containing missing values (geom_point).

In the example above, we can see that a minimum of 12 cases is needed to run the regression at all – any fewer and the degrees of freedom will be 0. More cases will result in a higher level of confidence that we can reject the null hypothesis–that there is no relationship between the terms in the model and the mileage of the cars in our dataset.

In the example below, we can see that looking at a simpler model allows us to draw more robust conclusions with much less data.

Ns <- 2:32
p_values <- c()
for(n in Ns){
  fit <- lm(mpg ~ cyl, sample_n(mtcars, n))
  f <- summary(fit)$fstatistic
  p_values = append(p_values, pf(f[1], f[2], f[3], lower.tail = FALSE))
}

df2 = data.frame(Ns, p_values)

ggplot(df2, aes(x = Ns, y = p_values)) +
  geom_point(size = 4) +
  geom_vline(xintercept = 2, linetype="dotted", color ="red") +
  geom_hline(yintercept=0.05, linetype="dotted", color="blue")+
  annotate(geom="text", 
           x=8, y=.2, 
           label="2 Cases Required to Fit 1-Term Model", 
           color="red")+
  annotate(geom="text", 
           x=5, y=0.06, 
           label="95% Confidence Level", 
           color="blue")+
  ylab("Model p-value")+
  ggtitle("Impact of Degrees of Freedom on Confidence Level for One-Term Linear Regession Model")

In practice, degrees of freedom is a measure of how complex a model we can fit to a given data set, or looked at another way, how confident we can be in our estimates given the data and terms in the model we have. More is better, but there are diminishing returns beyond a certain point.

References

https://blog.minitab.com/en/statistics-and-quality-data-analysis/what-are-degrees-of-freedom-in-statistics#:~:text=Degrees%20of%20freedom%20are%20often,vary%20when%20estimating%20statistical%20parameters.
“Degrees of Freedom in Statistics” by Jim Frost https://statisticsbyjim.com/hypothesis-testing/degrees-freedom-statistics/
https://online.stat.psu.edu/stat200/book/export/html/165

Linear Regression Explained: Degrees of Freedom

Alice Friedman

2/20/2021

What does the term “degrees of freedom” mean in statistics?

Degrees of Freedman when Evaluating a Sample Mean

Degrees of Freedom and Linear Regression

References