2023-04-10

What is Simple Linear Regression

Simple linear regression is a technique that is used in many different applications to estimate the relationship between two quantitative variables. It is typically used when:

  • The correlation between two variables
  • When you want to estimate the value of a dependent variable at a certain value of an independent value

The equation for linear regression is:

\[ y = a + bx \] In the equation:

  • a is the estimated intercept
  • b is the estimated slope or regression coefficient
  • x is the value of the explanatory variable

In another slide we will go over how to get variables a and b

How to find variable: a

As stated in the previous slide, a is the estimated y-intercept of the data. This value is estimated because no matter how perfect you get the data, you will never know the true value of the intercept.

To find a, you will use the formula: \[ a = \bar{y} - b\bar{x} \] The y with a bar over it is the mean of the dependent variables and the same goes for the x variable. The b is the same as before which is the estimated slope or regression coefficient.

How to find variable: b

As stated in the first slide, b is the slope or regression coefficient x is multiplied by. What that means is it is how much we expect y to change as x increases. The variable b is both used in the simple linear regression equation and the equation to find a.

To find b, you will use a couple of formulas: \[ b = \frac{s_{xy}}{s_{xx}} \] \[ s_{xx} = ∑(x_i - \bar{x})^2 \] \[ s_{xy} = ∑(x_i - \bar{x})(y_i - \bar{y}) \] There are numerous variables in these equations. To start, like the equation for a, the y and x bar variables stand for the mean of the x (explanatory) and y (dependent) variables. Next, the yi and xi variables stand for each consecutive x and y variable (x1, y1, x2, y2, …, xi, yi).

Importing Data - Trees Dataset

For this example I will be using the trees dataset that comes preinstalled with R. I will be using linear regression to test the data and see if there is a correlation between variables.

##   Girth Height Volume
## 1   8.3     70   10.3
## 2   8.6     65   10.3
## 3   8.8     63   10.2
## 4  10.5     72   16.4
## 5  10.7     81   18.8
## 6  10.8     83   19.7

This dataset is a collection of information about 31 felled cherry trees. The information is the girth (diameter) of the tree (in inches) measured at height 4 ft and 6 inches, the height of the tree (in feet), and the volume of the tree (in cubic ft).

Graphing the Data

Discussion of Graphed Data

When looking at the graph in the previous slide, it can be seen that overall there seems to be a slight positive correlation between height and volume. By this I mean that the volume of the tree will be larger if the height is taller, but there are some outliers in the graph. For example, when looking at the point that’s approximately 83 ft, it has a very small volume of 20 cubic feet. To continue, there is another point at approximately 82 ft that has a volume of approximately 57 cubic feet. This point follows the idea that there is a positive correlation between height and volume.

Graph With Line of Best Fit

## `geom_smooth()` using formula = 'y ~ x'

Discussion of Line of Best Fit

When looking at the graph with the line of best fit, it can be seen that it definitely is not the best fit for the data. One factor that could be effecting this is a very small dataset compared to others. Due to it being so small and only consisting of 31 points, this means that it is harder to find correlation between variables, and with more data, the correlation becomes more apparent. However, even though none of the points line up with the line, they do follow the upward trend of the line which could insinuate there is a positive correlation between the variables.

Taking it a Step Further

When looking at linear regression, you can also try to find a correlation between more than two variables. We won’t fully go into depth about this topic, but to start the discussion, I will plot a 3D scatter plot with plotly of all three variables (Height, Girth and Volume) to see if there is a correlation between the three variables. As stated before, you can apply regression techniques like line of best fit or a plane, but we won’t be going over that in this presentation.

Plotly Plot

When looking at the plotly plot, there is very clearly a positive correlation between the 3 variables. What this means is that as height and girth increase, the volume also increases.