By: McKenzie Hebert

What is Linear Regression

Linear regression is a statistical tool used to predict the outcome of one variable based on another variable. The most simple linear regression models use a “least squares” method to construct a line of best fit for the data.

Linear regression is important because it can help people make informed decisions about their data. Linear regression helps take raw data and turn it into real information that can help determine action.

An example dataset

This is our example dataset called “mtcars” which has data about cars including mpg and weight.

data(mtcars)
head(mtcars)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

A sample plot

This is our sample plot where we have mpg on the x axis and weight on the y axis. As we can see there is an inverse relation between them meaning that as weight goes up mpg goes down.

Plot with Linear Regression

On the next slide is the code for the same plot but with a linear regression line and confidence interval on top of it. We use the method = “lm” to place the linear model on top of the graph. We are using the default confidence interval for the grey area which is 95%. The line seen is the line of best fit for the data, and the grey area around the line is the confidence interval that the next data points will lie within the grey area.

Linear Regression Plot

g + geom_smooth(method="lm", se = TRUE) + coord_cartesian(ylim=c(0,6)) 

How to calculate linear regression

This formula represents a simple linear regression model, where y is the dependent variable, x is the independent variable, β0 is the intercept, and β1 is the slope coefficient.

\[ y = \beta_0 + \beta_1 x + \epsilon \]

Coefficient of Determination (R^2)

R^2 measures the proportion of the variance in the dependent variable that is predictable from the independent variable(s).

\[ R^2 = 1 - \frac{\sum_{i=1}^{n}(y_i - \hat{y}_i)^2}{\sum_{i=1}^{n}(y_i - \bar{y})^2} \]

Where \[ \hat{y}_i \] are the predicted values of y from the regression model. \[ \bar{y} \] is the mean of the observed y values. And n is the number of data points.

Linear Regression with R^2

We can even do a linear regression with the plotly package. This scatter plot shows Weight vs. MPG. It also displays the R squared coefficient on the plot. This plot is interactive, feel free to scroll over the data!

The End!

Thanks for watching my presentation! :)