Basic Linear Regression

  • Linear Regression is an statistical method used to model the relationship between a dependent variable and one or more independent variables.

  • Linear Regression assumes a linear relationship between variables and uses a best fit line to model the data which enables predictions outside of the data range

  • Linear Regression is commonly used to predict trends and future outcomes. Such as expected sales data or expected turn out at an event

The Best Fit Line

  • The best fit line runs through the data such that the distance of each data point from the line is as small as possible.

  • It is written as \(y = b_0 + b_1 x\) where

    • \(b_0\) is the intercept
    • \(b_1\) is the slope
  • A common method to find this line is the least squares method which minimizes the sum of the squared residuals given by \(\Sigma( y_i - \hat{y}_i)^2\) where

    • \(y_i\) is is the observed value
    • \(\hat{y}_i\) is the predicted value

Correlation

  • The correlation coefficient and is a measure of the strength of the linear relationship and its direction
  • Values always range from -1 to 1.
  • Values closer to 0 indicate a weaker linear relationship
  • values closer to 1 indicate a stronger linear relationship

\[ r = \frac{\sum (x_i - \bar{x})(y_i - \bar{y})} {\sqrt{\sum (x_i - \bar{x})^2 \sum (y_i - \bar{y})^2}} \]

Predictions and the Confidence Interval

  • Confidence intervals are used to account for the error that is present in the best fit line
  • The confidence interval represents the range of values likely to contain the average response
  • The confidence interval can be found using the following formula

\[ \hat{y} \pm t^* \cdot s \sqrt{ \frac{1}{n} + \frac{(x - \bar{x})^2}{\sum (x - \bar{x})^2} } \] \(\hat{y}\) = predicted value \(t^*\) = critical value \(s\) = standard error
\(n\) = sample size \(x\) = input value \(\bar{x}\) = mean

Positive Correlation Example Using mtcars

  • This is a simple example of a positive linear relationship.
  • As displacement increases, so does horsepower
  • The grey area indicates the confidence level for that region at 95%

Negative Correlation Example Using mtcars

  • This is a simple example of a negative linear relationship.
  • As displacement increases miles per gallon decreases
  • The grey area indicates the confidence level for that region at 95%

Non-linear correlation

  • Linear regression requires data to be linear for an accurate model
  • Very often data that is not linear and needs different methods for prediction
  • For example, the following plot uses economics data which is extremely nonlinear
  • The plot on the next slide includes the linear regression line
  • it should be clear from the plot that the linear model would not produce accurate predictions

Plotly Graph of Nonlinear data

model <- lm(I(unemploy / pop) ~ as.numeric(date), data = economics)

plot_ly(economics,x = ~date,y = ~unemploy / pop, type='scatter',
        mode='markers',name = '% unemployment') %>%
        add_lines(y = fitted(model), name="Linear Fit") %>%
layout(title = "Unemploymnet by Year", yaxis = 
      list(title = "Unemployment Rate",tickformat = ".1%"))