2025-03-16

Introduction

  • We’ll be going over the concept of Simple Linear Regression
  • What it’s used for, and how it connects to the concept of Confidence Interval
  • How it can be applied to the Trees data set

Simple Linear Regression

Simple Linear Regression is a way to find a linear relationship that describes a possible correlation between an independent(x) and dependent variable (y).

\(y = \beta_0 + \beta_1 x + \epsilon\)

Where: \(y\) is the dependent variable, \(x\) is the independent variable, \(\beta_0\) is the y-intercept, \(\beta_1\) is the slope, and \(\epsilon\) is the term that accounts for error.

Confidence Interval (Pt.1)

The Confidence Interval of a data set is a range of values where a certain piece of data is likely to exist and is expressed as a percentage of Confidence. A higher confidence typically means a wider interval of data.

To Calculate the Confidence Interval of a data set, you’ll need find its Z Value which is calculated by: \(Z = \frac{x - \mu}{\sigma}\)

Where: \(x\) is the individual data point, \(\mu\) is the mean of the data set, \(\sigma\) is the standard deviation.

Confidence Interval (Pt.2)

Based on this, we can use this table to calculate the Confidence Interval Based on the Z-Value:

Z-Scores to Confidence Interval
Z.Score Confidence.Interval
1.280 80%
1.645 90%
1.960 95%
2.330 98%
2.580 99%

Trees Plot (Plotly)

Here’s a plot of the Trees data set describing the correlation of Girth and Volume. The orange line you see describes the Linear Regression of the two variables. As you can see, there’s a linear, positive correlation between Girth and Volume of a tree.

Trees Plot (GGPlot)

Let’s describe the correlation between Girth and Volume of Trees using GGPlot:

Trees Plot (Confidence Interval Pt.1)

The gray area that you see on the graph represents data that has a 95% Confidence Interval Value, meaning we are 95% confident that we can we can guess a correct value within that interval.

  `geom_smooth()` using formula = 'y ~ x'

Trees Plot (Confidence Interval Pt.2)

If we increase the level to 99%, we see that it grows in size to compensate for the ability to be confident.

g = ggplot(data = trees, aes(x = Girth, y = Volume)) + geom_point()
g+geom_smooth(method = 'lm',level = .99) + ylim(0,80)
  `geom_smooth()` using formula = 'y ~ x'

Conclusion

  • Simple Linear Regression can be used to find possible linear correlations between two variables
  • Confidence Interval can be used to guess values at a certain point on a graph with a confidence based on the Z-Value of the dataset