2024-03-19

What is Linear Regression?

  • Statistical model to represent/estimate relationship between one or more variables
  • Can take unknown dependent variable and predict it by using the known independent variable

How does this relate to Data Science?

We can use a combination of statistics and coding to calculate linear regression to predict useful information, given what we already know.

Examples:

  • using rainfall to predict soil erosion
  • using age to predict income
  • using height to predict weight

The General Math

Linear Regression:

\(Y = a + bx\)

Y is the dependent variable, X is the independent variable, b is the estimated slope, and a is the estimated intercept.

\(a = \frac{(\sum y)(\sum x^{2}) - (\sum y)(\sum xy)}{n(\sum x^{2}) - (\sum x^{2})}\)

\(b = \frac{n(\sum xy) - (\sum x)(\sum y)}{n(\sum x^{2}) - (\sum x^{2})}\)

This is the equation we can use to predict data.

Scatter Plots

Let look at a visual representation of Linear Regression:

## `geom_smooth()` using formula = 'y ~ x'

This is a scatter plot, so you can visually see all the given data. The red line is the positive linear relationship. We can see the line predict data points that are not there. Scatter plots are the best way to plot the data and linear regression together, to get a visual representation.

The Code

Now lets break down the code so you understand how this graph is possible

# we are using a ggplot to make a scatter plot
# the data is set with x(height) and y (weight)
ggplot(data, aes(height, weight)) + 
  
#geom_point() add the points to the graph
  geom_point() + 
  
#this line of code adds the linear regression model
  geom_smooth(method='lm', se=FALSE, color='red') +
  
# the rest of the code helps with formatting how we want our graph to look 
  theme_minimal() +
  labs(x='Height (cm)', y='Weight (kg)', title='Height vs Weight') +
  theme(plot.title = element_text(hjust=0.5, size=20, face='bold')) 
## `geom_smooth()` using formula = 'y ~ x'

Code to get the Linear Regression model

Now, you might be wondering “how did we get the red line?”

We used the code

model <- lm(weight ~ height)
coefficients <- coef(model)
intercept <- coefficients[1]
slope <- coefficients[2]

The lm() function takes the data in Y, “weight” as the dependent variable and the data from X, “height”, as the independent variable. Then we can use coef() to get the coefficients of the equation, from the vector. This is how we can get linear regression model using code.

The Math Behind the Code

Now to relate that code to the Equations from before:

\(Y\) is the dependent variable, weight

\(X\) is the independent variable, height

\(a\) is the intercept found from the first value in the coefficient vector, which is the value of weight, when height is zero

\(b\) is the slope found from the second value, which is the change in weight for one-unit change in height

This gives us all the pieces to create the equation:

weight = intercept + slope*height

\(Y = a + bx\)

The code does the math for you!

Negtiave linear regression?

To give you another example here is a graph that looks a little different.

## `geom_smooth()` using formula = 'y ~ x'

This is an example of a data set with a negative regression line. That means as the independent variable (age) increases, the dependent variable (reaction time) decreases.

Ploty Graph

To allow you to explore the data along with the re regression line, here is an interactive ploty graph. The Blue data points are the actual data we have, while the orange points are points on the red linear regression line that predicts data.