Simple Linear Regression

What is Simple Linear Regression?

Simple Linear Regression is used for modeling the correlation between two variables.

There is an independent, explanatory variable (x), and a dependent variable (y).

Here is a plot from the r-dataset iris where Petal.Length is used as an explanatory variable for Petal.Width.

Correlation

The plot in the previous slide showed a positive correlation, which means as x increases, y also increases.
An example of a graph with negative correlation is on the left and one with no correlation is on the right.

Remember: Correlation does not equal Causation!

Regression Plot

We can make a simple linear regression model that estimates the Petal Width of an Iris using the Petal Length.

Check out the Interactive Plot!

Regression Equation

Our simple linear regression model for the given data is shown by this equation: \[\text{Petal Width} = \beta_0 + \beta_1\cdot \text{(Petal Length)} + \varepsilon; \hspace{1cm} \varepsilon \sim \mathcal{N} (0; \sigma^2)\] where \({\varepsilon}\) is the error term.

Our estimated model (the orange line on the graph) is represented by this equation: \[\text{Petal Width} = \hat{\beta}_0 + \hat{\beta}_1\cdot \text{(Petal Length)}\]
\({\hat{\beta}_0}\) is the y-intercept of the orange line.
\({\hat{\beta}_1}\) is the slope of the orange line.

Least Squares Regression Line

The way to find the best fitting line for a simple linear regression is by using the least squares method. This can be done by following these steps:

- Step 1: For each \({(x,y)}\) point, calculate \({x^2}\) and \({xy}\)

- Step 2: Sum all \({x}\), \({y}\), \({xy}\), and \({x^2}\), which gives us \({\Sigma x}\), \({\Sigma y}\), \({\Sigma xy}\), and \({\Sigma x^2}\).

- Step 3: Find \({\hat{\beta}_1}\) where \({n}\) is the number of data points:

\[\hat{\beta}_1 = {n\Sigma(xy) - \Sigma x\Sigma y\over n \Sigma(x^2)-(\Sigma x)^2}\] - Step 4: Find \({\hat{\beta}_0}\):

\[\hat{\beta}_0 = {\Sigma y - \hat{\beta}_1\Sigma x\over n }\]

Why Least Squares?

The goal of the regression line is to minimize the errors between the actual data and the estimated model, while also having the same sum of errors above the line as below.

A straight line at the mean of y would satisfy the second condition, but not the first.

The straight line going straight through the most number of points could reduce error terms overall but could have a disproportionate number of points above the line, meaning the first condition would not be satisfied.

The best way to meet both conditions is to use the least squares method, which calculates the differences between the actual values and the predicted values, squares them, and finds the line that keeps the sum of the differences squared to a minimum.

Running a Regression in R

To run a simple regression in R, follow these commands with the variables and dataset of your choice:

  model1 <- lm(dependentvar ~ independentvar, data=dataset)
  summary(model1)

Running a Regression in R (cont.)

Example using Petal Width by Petal Length:

  mod <- lm(Petal.Width ~ Petal.Length, data=iris)
  summary(mod)

Call:
lm(formula = Petal.Width ~ Petal.Length, data = iris)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.56515 -0.12358 -0.01898  0.13288  0.64272 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -0.363076   0.039762  -9.131  4.7e-16 ***
Petal.Length  0.415755   0.009582  43.387  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.2065 on 148 degrees of freedom
Multiple R-squared:  0.9271,    Adjusted R-squared:  0.9266 
F-statistic:  1882 on 1 and 148 DF,  p-value: < 2.2e-16

Interpreting a Regression from R

R prints out a lot of information you may or may not be looking for. Right now, we are just looking for \({\hat{\beta}_0}\) and \({\hat{\beta}_1}\). Remembering back to the model: \[\text{Petal Width} = \hat{\beta}_0 + \hat{\beta}_1\cdot \text{(Petal Length)}\] we can see that we are looking for the coefficients of the intercept and Petal Length. Those are shown in the summary (shown on the previous slide), but you can also use the code shown below.

  coefficients(mod)

 (Intercept) Petal.Length 
  -0.3630755    0.4157554

Those two numbers are the \({\hat{\beta}_0}\) and \({\hat{\beta}_1}\) for our estimated model.

Wrap-Up

Obviously there is a lot more to linear regressions than fit in this simple overview. Significance, hypothesis testing, regressions with multiple independent variables, and more are out there to explore, but this should give you a basic idea of how to get started with simple linear regressions!