2026-06-06

Linear Regression

Linear regression is the statistical comparison of two or more variables using the equation:

\(y = \beta_0 + \beta_1 x + \varepsilon\)

Where \(y\) is the variable we want to predict, \(x\) is the predictor, \(\beta_0\) is the intercept, \(\beta_1\) is the slope and \(\varepsilon\) is the random error.

Setting up the linear regression model in R is a relatively simple affair. The coding looks like:

mod <- lm(your_y_variable ~ your_x_variable, data = your_dataset)

Our Dataset

We’ll use the bodyfat dataset from the mfp package which provides us body measurements from 252 men. The columns we care about are:

  • siri: body fat %, derived from underwater weighing/density (our \(y\))
  • abdomen: abdominal circumference (our first predictor variable)
  • chest: second circumference (our second variable)
  • hip: third circumference (our third variable)

Example One

How well does abdominal circumference predict body fat %?

First we will draft a scatter plot to see if there is an overall trend.

Fitting a Line

Since we can visually see a positive trend in the scatter plot but cannot visually quantify it accurately, we fit a line using ggplot.

The output

mod <- lm(siri ~ abdomen, data = bodyfat)
summary(mod)
## 
## Call:
## lm(formula = siri ~ abdomen, data = bodyfat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -19.0160  -3.7557   0.0554   3.4215  12.9007 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -39.28018    2.66034  -14.77   <2e-16 ***
## abdomen       0.63130    0.02855   22.11   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.877 on 250 degrees of freedom
## Multiple R-squared:  0.6617, Adjusted R-squared:  0.6603 
## F-statistic: 488.9 on 1 and 250 DF,  p-value: < 2.2e-16

Statistic Value
Estimate (Slope) 0.6313
R-Squared 0.6617
p-value 0.0000


Intercept: The estimate is the intercept of our function or \(\beta_0\) in the equation. That means when abdomen is \(0\) cm, the model predicts the individual will have a siri of -39.2801847%. Since no one has an abdomen of \(0\) and negative body fat is impossible, the intercept has no real-world meaning for our purposes.

Estimate: The estimate is the slope of our function or \(\beta_1\) in the equation. In our case it is 0.6313044. Which means we have positive correlation.

R^2: \(R^2\) in our context tells us what fraction of the differences in the body fat % our model can explain. For example an \(R^2\) of \(0.66\) tells us that the model explains \(66\%\) of variation in body fat %. The remaining \(34\%\) is from variables we did not measure.

p-value: Gives a value from 0-1, where a number closer to 0 means the relationship is less likely to be due to chance. Most standards are looking for a p-value of \(.05\). Our value 9.0900667^{-61} tells us that our model is very unlikely to be random.

Understanding R-Squared

The Math:

\(R^2 = 1 - \frac{RSS}{TSS}\)

RSS is the residual sum of squares, or the sum of the squared differences between the observed and predicted values:

\(\sum(y_i - \hat{y}_i)^2\)

TSS is the total sum of squares, or the sum of squared differences between the observed values and their mean:

\(\sum(y_i - \bar{y})^2\)

The issue: Since \(R^2\) naturally increases when predictors are added. It can’t tell us wheter a new predictor actually helped.

Introducing Adjusted R-Squared

The Math:

Adjusted \(R^2 = 1-(\frac{(1-R^2)(n-1)}{n-p-1})\)

Where \(n\) is the number of observations, and \(p\) is the number of predictors. This is provided to penalize variables that do not contribute. Allowing us to build more accurate models, rather than throwing in every variable without thought/purpose.

Example Two

Hypothesis: The more body circumferences we add to the model, the more accurately we can predict the siri body-fat percentage.

Process: With the single predictor we fit a line, with two we fit a plane, and from there on we cannot visualize, but the math works the same.

Example Two Output

First we look at the output of just \(R^2\)’s:

mod_two <- lm(siri ~ abdomen + chest, data = bodyfat)
mod_three <- lm(siri ~ abdomen + chest + hip, data = bodyfat)
Model R_squared
abdomen 0.6617
abdomen + chest 0.6728
abdomen + chest + hip 0.6993

How does this look when viewing R^2 Adjusted?

Since we knew that \(R^2\) would increase as we increased variables let’s look at how \(R^2\) Adjusted compares:

Model R-Squared Adjusted R-Squared
abdomen 0.6617 0.6603
abdomen + chest 0.6728 0.6702
abdomen + chest + hip 0.6993 0.6956

Conclusions

Since we have seen that the \(R^2\) Adjusted values hold close to the \(R^2\) values, we know that the model is actually working. This shows that the more measurements someone takes, the closer they get to actually replicating costly density scans.

A note: the majority of the roughly 70% in our prediction model is from abdominal measurements; the incremental gains of percentage from the chest and hip are small in comparison.