2025-03-15

Plotting Girth and Volume

Let’s say we want to find the correlation between a tree’s girth with its volume.

Because it seems as if there’s a fairly strong correlation between girth and volume, these data points can be used to predict the volume of a tree based on its girth.

Regression Line Equation

We use a regression line to make a prediction of the volume (y-hat) using the given features (x variables).

\[ \hat{y} = m_1 x_1 + \dots + m_n x_n + b \] (In the case of having only one feature, the equation is \(\hat{y} = m x + b\))

Error Function

To determine which equation creates the “line of best fit”, we first need an error function to compare the strength of each line.

\[ \sum_{i=1}^j (\hat{y}_i - y_i)^2 \]

(The line of best fit gives the minimum of the error function–least squared error in this case.)

Line of Best Fit Using ggplot2

ggplot(data=trees, aes(x=Girth, y=Volume)) +
  geom_point(size=3) + geom_smooth(method="lm", formula=y~x)

Line of Best Fit Using plotly

mod = lm(trees$Volume~trees$Girth)
plot_ly(x=trees$Girth, y=trees$Volume, type="scatter", mode="markers") %>%
  add_lines(x=trees$Girth, y=fitted(mod), name="fitted")

Predicting Volumes Based on Girth

We can then use the regression model to predict the volume based on a given girth value. For example, if we want the volume of 3 trees with girths of 10, 12, and 15 units, we will obtain…

  • Girth: 10 => Volume: 13.715
  • Girth: 12 => Volume: 23.847
  • Girth: 15 => Volume: 39.044

Multi-Dimensional Regression

The previous regression example only accounts for 1 independent variable, but it can be expanded to any number; note that the number of dimensions will grow the more features there are. This scatter plot shows how both girth and height affect volume.