2024-06-09

Linear Regression

  • Given input term(s), we want to be able to predict an output/response using a linear model
  • The simplest case: 1 input

  • Longley’s Economic Regression Data, showing the relationship between the number of people employed and gross national product.

R function lm()

lm_econ <- lm(GNP ~ Employed, data=longley)
  • The lm()(linear model) function accepts the inputs in the order of response ~ terms, plus the dataset in question.
  • By calling the summary() function on lm_econ, we can see useful facts about the regression model. (See next slides.)
  • For the current data, it will match a \(y = mx + b\) formula.

lm() continued

The first column of numbers under “Coefficients” shows the y-intercept and the slope, which is, in this case, the weight assigned to the variable “Employed”. The R-squared values are very close to 1, indicating a good fit.

lm() continued

We can also see the p-values in the column Pr(>|t|). In this instance, the p-value is much smaller than the common threshold of 0.05, again indicating that our model is a good predictor.

Graphing linear regression (plotly)

  • We add y = fitted(lm_econ) as an argument to plotly’s add_lines() to graph our line of regression.

Graphing linear regression (ggplot2)

  • We can employ geom_smooth(method = "lm") without using the lm_econ we previously calculated. By default, ggplot2 shows the confidence interval.

Graphing residuals

  • Graphing residuals is another way of checking the validity of our model. They should look fairly “random”, if not, our model might be a bad fit.

Multiple input terms

  • For \(n\) input terms, instead of a simple \(y = mx+b\) formula, our equation would resemble something more like \[y = w_{0} + w_{1}x_{1} + w_{2}x_{2} + ... + w_{n}x_{n} \]
  • \(w_{0}\) replaces the y-intercept from our simple 2-d formula, and all the other \(w\) terms are weights for the input variables, giving us a vector of weights, \(\mathbf{w}\)

Multiple input terms, continued

  • For the function lm(), we can add multiple inputs using +:
lm_econ_multi <- lm(GNP ~ Employed + Armed.Forces, data=longley)
  • With such a high p-value, the number of people in the armed forces is a bad predictor of GNP.