Using R for Linear Regression

In this presentation we will explore simple linear regression methods in R

  • Besides the powerful computational abilities in R, R also produces useful plots to visualize data and perform fit functions to your data
  • 2 useful tools for Linear Regression are
    • Plotly
    • GGPlot

What is Linear Regression?

Simple linear regression is a modelling technique of relating 2 variables - an independent variable \(x\) and a dependent variable \(y\), linearly from the standard slope equation: \[y = ax +b\]

written in statistics as: \[y = \beta_{0} + \beta_{1} \cdot x\] where \(\beta_{1}\) is the modeled slope and \(\beta_{0}\) is the modeled y-intercept, both determined by fitting functions computed in this case in R.

The standard fitting technique for simple linear regression is the Ordinary Least Squares method, where a fitting model minimizes the sum of the square of vertical offsets of each data point to determine the best fit of the line.

What is Linear Regression?

In the Ordinary Least Squares method, the error term (vertical offset), \(\epsilon\), is calculated from each data point: \[y_{i} = \beta_{0} + \beta_{1} \cdot x_{i} + \epsilon_{i}\]

the model finds the minimal value of the sum of square vertical offset by determining the minimum output of the sum of the errors squared: \[Q = \sum_{i = 1}^{n} \hat{\epsilon}_{i}^{2} = \sum_{i = 1}^{n} (y_{i} - \hat{y}_{i})^{2} = \sum_{i = 1}^{n} (y_{i} - (b_{0} + b_{1} \cdot x_{i})^{2}\]

which also solves for the values in the line equation \(b_{1}\) and \(b_{0}\): \[b_{1} = \frac{\sum_{i = 1}^{n} (x_{i} - \overline{x})(y_{i} - \overline{y})}{\sum_{i = 1}^{n} (x_{i} - \overline{x})^{2}}\] and \[b_{0} = \overline{y} - b_{1} \cdot \overline{x}\]

R Code Examples

Here is the R code for 2 of the plots on earlier slides. Code for Plotly tends to be more involved.

GGPlot example

g <- ggplot(Orange, aes(x= age, y=circumference)) + geom_point()
g + geom_smooth(method="lm", level = 0.99) + coord_cartesian(ylim = c(0,250))

Plotly Example

mod <- lm(mpg ~ wt, data=mtcars)
x = mtcars$wt; y = mtcars$mpg

xax <- list(title = "Weight", titlefont = list(family="Modern Computer Roman"))
yax <- list(title = "Miles per Gallon", titlefont = list(family="Modern Computer Roman"), 
            range = c(0,35))
fig <- plot_ly(x=x, y=y, type="scatter", mode = "markers", name = "data") %>%
  add_lines(x=x, y=fitted(mod), name="fitted") %>%
  layout(xaxis = xax, yaxis = yax) %>%
  layout(margin=list(l=150, r=50, b=20, t=40))
config(fig, displaylogo=FALSE)

Linear Regression Plot with Plotly

Using the mtcars dataset in R, we can plot the correlation between weight and fuel efficiency. As you can see there is a negative correlation between weight and fuel efficiency - the heavier the car, in general the more fuel inefficient it will be.

Linear Regression Plots with GGPlot

Using the Orange dataset in R, we can examine the correlation between the age and circumference of orange trees. As you can see there is a positive correlation between the two variables - the older the tree, the larger the circumference. Note that GGPlot includes a confidence band on the plot - in this instance set to 99%.

Linear Regression Plots with GGPlot

Here is another example of a positive correlation between displacement and horsepower. Note the broader confidence band (also set to 99%). This GGPlot also conveys a 3rd parameter - number of cylinders (note the trend of displacement and horsepower increasing as the number of cylinders increase).

Summary

As you can see, linear regression is a useful tool that can be easily implemented with R - some important notes to remember:

  • Linear regression is a useful model to estimate the relationship between two quantitative variables

  • It is important not to rely on linear regression alone to infer causation - more analysis is needed to determine causality

  • Linear regression can be a useful tool in forecasting for variables with strong correlation