2025-04-09

What’s The Connection?

When looking at a data set, it may be difficult to immediately tell if there are relationships between different variables. Even if you can tell there is a trend, and may be able to point it out, how accurately would you be able to describe it, mathematically? Would you be able to give a proper estimate of what the data would look like in between your recorded plot points? What about points beyond the limits of your data?

If only there was a way to describe data in this way, properly identifying the relationship between multiple different variables, a model which could form a line between yourself and the truth…

Linear Regression

Linear Regression is a statistical model that can help identify relations between multiple variables, and can empower researchers to extrapolate information from the limitations of finite data sets.

The mathematical relationship is defined with the following equation:

\[ y = mx + b \]

where:

  • y’ is the expected value of the dependent variable

  • x’ is the presumed value of the independent variable

  • m’ is the defined relationship between the dependent and independent variable.

Example: Length & Width of Petals

For example, lets look at the data below! How can you reconcile petals with the same width, but different lengths? What about determining the length of petals with a width between 0.5 and 1.0?

…Ta-da! See how a linear regression model can take vague or missing data, and enlighten us? A collection of data points have now become knowledge.

Linear Squares

The linear progression model uses a straight line which is defined to match as closely to the data as possible. However, it is not usually possible for a line to pass exactly through each data point, meaning it is important to find the line that fits as closely to the data as possible. The ‘Least Squares’ method is commonly used to determine this line.

\[ S = \sum\limits_{k = 1}^n r^2_{i} \]

where n represents the amount of points in the data set, and r represents the distance between the supposed line and a data point. The closer the sum ( S ) is to zero, the better the line fits the data.

Regression in GGPlot

Add linear regression manually by calculating with Least Squares (geom_abline) or with an automatic algorithm (geom_smooth). Both examples are shown below. (Note, the visuals aspects of the coding, such as color=““, theme_bw(), labs(), and linewidth are purely for aesthetic, and not necessary)

ggplot(iris, aes(x=Petal.Width, y=Sepal.Length)) + 
  labs(y="Septal Length", x="Petal Width") +
  geom_point() +
  geom_smooth(method = "lm", se=FALSE, formula = 'y ~ x',
              color = "cyan") +
  geom_abline(intercept = 5.5, slope = 0.5, 
              color = "azure3", linewidth = 0.9) +
  theme_bw()

Putting this code in R will get you the graph, shown on the next slide.

Here we see two lines, one with a slope created manually (grey) and one using algorithm to determine the best slope (cyan). Can you tell which one has a smaller ‘Least Sqaures’ sum?

Regression in Plotly

Add a linear regression line in plotly using the lm() function, which can be used a guide for plot lines.

linearmodel = lm(Petal.Length ~ Petal.Width, data=iris)

3D Linear Regression

Given a set of multiple variables which can all utilize a respective independent variable, a regression line may be helpful in interpreting or extrapolating data.

Downsides to Linear Regression

Of course, there are cases in which using the linear regression model would not be the most effective or accurate for the given data set. Some examples of data that not include:

  • Variables with non-linear relationships (exponential, logarithmic, etc.)

  • Variables with little/no relationship or correlation to each other.

  • Sets with nominal data (True/False, Gender, Species, etc)

Linear regression also may struggle to perform accurately with sets with very little data, or in 3D analysis with data that do not all share a linear relationship to the same independent variable.

When to Use Linear Regression

Aside from its simplicity, linear regression is a very powerful tool which can be used in many great ways. Linear Regression may be an effective model in such cases:

  • Using in data sets with linearly related variables

  • Extrapolating data within holes or outside boundaries of the set.

  • Analyzing data that has been recorded through time.

  • Predicting data in between recorded plot points.

  • Using in conjunction with other, more complex analysis models.