3/27/2020

Simple Linear Regression

What is Simple Linear Regression?

Simple Linear Regression is a linear model that models the relationship between two variables. One is the dependant variable and the other is the independant variable. We try to minimize the sum of squared residuals of the best fit line. This model is extremely useful. It can help us determine if there is any type of correlation between the variables and if that correlation is statistically significant in any way.

How it works

The model function for a simple linear regression is \(y = \beta_0 + \beta_1 x + \varepsilon\). The goal is to minimize the sum \(\sum_{i=1}^{n} (y_i - \hat{y}_i)^2\). In this sum \(y_i\) is the observed value of the dependent variable and \(\hat{y}_i\) is the estimated value of the dependent variable. Here \(\hat{y} = b_0 + b_1x\). We calculate \(b_1\) by the expression \(\frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^{n} (x_i - \bar{x})^2}\). Here \(\bar{x}\) and \(\bar{y}\) are the mean for the independent and dependent variables.

How it works (cont’d)

The value of \(b_0\) is found after we find \(b_1\). We find \(b_0\) by using the point \((\bar{x},\bar{y})\), which is called the center of mass of the data of the given data that we are performing the regression on. The regression line will pass through this point. For example, say \(\bar{x} = 3, \bar{y} = 2\) and \(b_1 = 5\). Then we have \(2 = b_0 + (5)(3)\). We simply solve for \(b_0\). By doing this we find the regression line for the given data set.

Simple Linear Regression Example

Here is an example of the mtcars dataset with the mpg and weight plotted in a scatter plot. Also, the red line is a simple linear regression of the mpg and weight.

R code

This is the R code for the previous slide.

simp <- lm(mtcars$mpg ~ mtcars$wt)
slr <- mtcars %>% 
  plot_ly(x = ~wt) %>% 
  add_markers(y = ~mpg, name = "Original Data") %>% 
  add_lines(x = ~wt, y = fitted(simp), name = "Regression")
slr

Goodness of fit

How do we determine that a linear regression has given us a good model of the data? We accomplish this by comparing it to a simpler model of the data. One in which only the dependent variable is graphed with a simple scatter plot along with the mean of the dependent variable values. We measure the distance between each value and the mean. We square these values and then add then add them all up which gives us what is called the SST, Sum of Squres Total.

There are also two other values that are of importance. The SSE and the SSR. The SSE is like the SST. We find the SSE by squaring the diffence of each data point in the dataset and the estimated value given by the regression line and then summing these all up. SSE means the Sum of Squared Errors. The SSR, which means the Sum of Squares due to Regression, is how much of the error was taken up by the regression.The SSR is calculated like so, \(SST - SSE = SSR\).

Simpler Model of Data

In this graph we see only the dependant values plotted along with its mean. It is this model that the regression is compared to to determine how well of a model it is. The SST in this example is the sum of the squred difference bewteen each data point and the red dashed line.

Linear Regression

This linear regression will minimize the SSE compared to the SSE from the previous graph. The left over amount from the SST is due to regression and is the SSR. If we were to add up all the squares of the distance

Coefficient of Determination

The coefficient of determination will give us an idea of how well this linear regression fits the data. The coefficient of determination is calculated by, \(r^2 = \frac{SSR}{SST}\). For this example we see that the coefficient of determination is 0.7528328. This means that 75% of the SST can be aacounted for by the linear regression.

This number can be misleading though. This number really only helps if the data we are performing the linear regression on is mostly linear.