March 16, 2025

What is Linear Regression?

Simple linear regression is a statistical method that models the relationship between:

  • A dependent variable (Y)
  • An independent variable (X)

It assumes a linear relationship between these variables:

\[Y = \beta_0 + \beta_1 X + \varepsilon\]

Where: - \(\beta_0\) = intercept - \(\beta_1\) = slope - \(\varepsilon\) = error term (noise)

The Mathematical Foundation

The linear regression model estimates parameters by minimizing the sum of squared errors (SSE):

\[SSE = \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 = \sum_{i=1}^{n} (y_i - (\hat{\beta}_0 + \hat{\beta}_1 x_i))^2\]

The parameter estimates are given by:

\[\hat{\beta}_1 = \frac{\sum_{i=1}^{n} (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^{n} (x_i - \bar{x})^2}\]

\[\hat{\beta}_0 = \bar{y} - \hat{\beta}_1 \bar{x}\]

Example Dataset: Trees

data(trees)
head(trees)
##   Girth Height Volume
## 1   8.3     70   10.3
## 2   8.6     65   10.3
## 3   8.8     63   10.2
## 4  10.5     72   16.4
## 5  10.7     81   18.8
## 6  10.8     83   19.7
##      Girth           Height       Volume     
##  Min.   : 8.30   Min.   :63   Min.   :10.20  
##  1st Qu.:11.05   1st Qu.:72   1st Qu.:19.40  
##  Median :12.90   Median :76   Median :24.20  
##  Mean   :13.25   Mean   :76   Mean   :30.17  
##  3rd Qu.:15.25   3rd Qu.:80   3rd Qu.:37.30  
##  Max.   :20.60   Max.   :87   Max.   :77.00

Visualizing the Data

p <- ggplot(trees, aes(x = Girth, y = Volume)) + 
  geom_point(color = "#8C1D40", size = 2, alpha = 0.7) +
  labs(title = "Tree Volume vs Girth",
       x = "Girth (inches)",
       y = "Volume (cubic feet)") +
  theme_minimal()
p

Fitting a Linear Model

3D Visualization with Multiple Predictors

Multiple Linear Regression

When we have multiple predictor variables, the model becomes:

\[Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + ... + \beta_p X_p + \varepsilon\]

For our trees data with two predictors (Girth and Height):

\[Volume = \beta_0 + \beta_1 \times Girth + \beta_2 \times Height + \varepsilon\]

Multiple Linear Regression Summary

multi_model <- lm(Volume ~ Girth + Height, data = trees)
summary(multi_model)
## 
## Call:
## lm(formula = Volume ~ Girth + Height, data = trees)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.4065 -2.6493 -0.2876  2.2003  8.4847 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -57.9877     8.6382  -6.713 2.75e-07 ***
## Girth         4.7082     0.2643  17.816  < 2e-16 ***
## Height        0.3393     0.1302   2.607   0.0145 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.882 on 28 degrees of freedom
## Multiple R-squared:  0.948,  Adjusted R-squared:  0.9442 
## F-statistic:   255 on 2 and 28 DF,  p-value: < 2.2e-16

Residual Analysis

The Code Behind the Visualizations

Here’s the code used to create the 3D visualization:

plot_ly(trees, x = ~Girth, y = ~Height, z = ~Volume, 
        type = "scatter3d", mode = "markers",
        marker = list(size = 5, color = ~Volume, 
                     colorscale = "Viridis", 
                     opacity = 0.8)) %>%
  layout(scene = list(
    xaxis = list(title = "Girth (inches)"),
    yaxis = list(title = "Height (feet)"),
    zaxis = list(title = "Volume (cubic feet)")
  ),
  title = "3D Relationship Between Tree Measurements")

Conclusion and Application

Linear regression is widely used in many fields such as Finance, Economics, Medicine, Ecology and more.

In our example with trees, we demonstrated that tree volume can be predicted with reasonable accuracy from girth measurements (R² = 0.94).