QM Workbook 2

Regression for Describing and Forecasting

We ended up in the last lesson calculating the slope of the regression line, which summarizes the relationship between two vectors of data. Regression analysis is the workhorse of much of modern social science and policy analysis, so we need to go much deeper into this topic.

As we say, the slope of the regression line is just a normal rise-over-run slope value, and it allows us to make predictions about a value of Y based on X. In a scatterplot of two variables against each other, if the line has an intercept of 0, we just move up across x-axis from the the 0 intercept along the line, going up the y-axis according to the slope value - for each one unit you move right, you move vertically according to the slope coefficient.

Now, our guess might not be the same as the actual value, but that is our best guess given the means and variances of the two variables, and their covariance. In other words, it is our best guess given the structure of the data we have from this sample.

So in this example, the calculated slope of the regression line is about 2, to figure out the value Y when X is 5, given that the y-intercept here is 0, we move 2 units up each time we move 1 unit right, so our best guess of Y when X equals 5 is about 10.

set.seed(123)
x<-runif(100,1,10)
y <- (0+x*2) + rnorm(100,0,1)

slope<-cov(x,y)/var(x)
slope

[1] 1.990019

plot(x,y,xlim=c(0,15))
abline(coef=c(0,slope))

Here, I’ve changed the data so that the y-intercept is 10 and slope of the regression line is negative 3. So, our best guess about the value of y at x=5 is 10 + (5*-3), which is -5.

set.seed(123)
x<-runif(100,1,10)
y <- (10+x*-3) + rnorm(100,0,1)

slope<-cov(x,y)/var(x)
slope

[1] -3.009981

plot(x,y,xlim=c(0,15),ylim=c(-20,10))
abline(coef=c(10,slope))

We can turn this intuition into what is called a regression equation, which would look like this:

\[ E[y_i] = \alpha + \beta*x_i \]

In English, we would say that our expected (i.e., predicted) value of Y is equal to the intercept (Greek alpha, $\alpha$) plus the slope coefficient for (X,Y) (Greek $\beta$ ) multiplied by an an element of the X vector. This is short hand for exactly what we just did with rise-over-run.

Many terms are used when describing a regression equation. $\alpha$ and $\beta$ are parameters. The outcome is the predicted value, or the dependent variable. x is the independent variable. The regression equation expresses the linear relationship and X and Y, because we can literally draw this line on our scatterplot.

You might notice that our description of our data using the regression equation is incomplete. The regression equation summarizes the line of best fit, but it doesn’t account for the actual variation we see in the cloud of points. Most of our actual points are not on the line of best fit. To account for this, we add to the regression equation the error term, Greek letter epsilon, $\epsilon$.

Error in a regression equation is equal to the difference between the predicted value minus the actual value for each pair of points in the two vectors, so:

\[ error_i = actual_i - predicted_i \] We add this into the regression equation, so:

\[ E[y_i] = \alpha + \beta*x_i + \epsilon_i \]

It’s easy to visualize error. Here is a scatterplot of two variables.

# https://rpubs.com/iabrady/residual-analysis

d <- mtcars

ggplot(d, aes(x = wt, y = mpg)) +
  geom_point()

Add a regression line.

ggplot(d, aes(x = wt, y = mpg)) +
  geom_point()+
  geom_smooth(method="lm",se = FALSE)

The error for each point is the distance from the point to the lines that predicts values. The length of each of these lines is called a residual, and collective the the vector of these distances is the residuals.

fit <- lm(mpg ~ wt, data = d) # fit the model
d$predicted <- predict(fit)   # Save the predicted values
d$residuals <- residuals(fit) # Save the residual values

ggplot(d, aes(x = wt, y = mpg)) +
  geom_point()+
  geom_smooth(method="lm",se = FALSE)+
  geom_segment(aes(xend = wt, yend = predicted), alpha = .2)

In the last lesson, we waved our hands at how we knew that the slope we calculated using cov(x,y)/var(x) was the best possible slope that characterized the relationship between X and Y. Now, we can see more clearly why this is.

The reason why we pick this line is that, of all possible lines, this line we draw is the one that yields the minimum the sum of squared errors (SSE). To say it again slightly differently: this line minmizes the SSE. The SSE is what you would get if you calculated the difference between each predicted value and the actual value (remembering that there could be negative values here), squared each of those values (so you get all positive values), and then added all those values up. To use some technical language possibly incorrectly, the SSE is kind of the like a measure of the deviation of the actual points from the predicted points. We want to minimize this deviation, and SSE provides us a mathematical avenue for doing this.

We are not going to get into the mathematical reasons for why it is the slope coefficient is always going to be the line that minimizes the SSE. Very smart statisticians and mathematicians have figured this out for us, and they have implemented computer algorithms that use matrix algebra to calculate the parameters of the regression line with lightning speed. With these tools, it is incredibly easy to take a vector of outcomes and vector of predictors and estimate the intercept and slope coefficient for a regression equation, and also the calculate the sum of squared errors.

In R, the function that implements the matrix algebra we need to do this is lm(). You might have noticed that I used this function in above when I made the residual plot.

I’m going to do the same thing and walk through it. We use lm() to fit the model, which means to ask R to calculate the parameters. R uses matrix algebra and does this in seconds.

d <- mtcars ### Import data

mod<-lm(mpg ~ wt, data = d) # fit the model - this means "calculate the parameters"

### show me the coefficients
print(mod)


Call:
lm(formula = mpg ~ wt, data = d)

Coefficients:
(Intercept)           wt  
     37.285       -5.344

## resulting model object is a list that contains within it additional information

This object “mod” that we saved the lm output to is a list object in R that contains within it a lot of information. You can see this information with names().

names(mod)

 [1] "coefficients"  "residuals"     "effects"       "rank"         
 [5] "fitted.values" "assign"        "qr"            "df.residual"  
 [9] "xlevels"       "call"          "terms"         "model"

Looking at this list, you see the mod object includes the coefficients.

mod$coefficients

(Intercept)          wt 
  37.285126   -5.344472

And the residuals, which are labeled here.

mod$residuals

          Mazda RX4       Mazda RX4 Wag          Datsun 710      Hornet 4 Drive 
         -2.2826106          -0.9197704          -2.0859521           1.2973499 
  Hornet Sportabout             Valiant          Duster 360           Merc 240D 
         -0.2001440          -0.6932545          -3.9053627           4.1637381 
           Merc 230            Merc 280           Merc 280C          Merc 450SE 
          2.3499593           0.2998560          -1.1001440           0.8668731 
         Merc 450SL         Merc 450SLC  Cadillac Fleetwood Lincoln Continental 
         -0.0502472          -1.8830236           1.1733496           2.1032876 
  Chrysler Imperial            Fiat 128         Honda Civic      Toyota Corolla 
          5.9810744           6.8727113           1.7461954           6.4219792 
      Toyota Corona    Dodge Challenger         AMC Javelin          Camaro Z28 
         -2.6110037          -2.9725862          -3.7268663          -3.4623553 
   Pontiac Firebird           Fiat X1-9       Porsche 914-2        Lotus Europa 
          2.4643670           0.3564263           0.1520430           1.2010593 
     Ford Pantera L        Ferrari Dino       Maserati Bora          Volvo 142E 
         -4.5431513          -2.7809399          -3.2053627          -1.0274952

And the predicted values, which are called “fitted” values here.

mod$fitted.values

          Mazda RX4       Mazda RX4 Wag          Datsun 710      Hornet 4 Drive 
          23.282611           21.919770           24.885952           20.102650 
  Hornet Sportabout             Valiant          Duster 360           Merc 240D 
          18.900144           18.793255           18.205363           20.236262 
           Merc 230            Merc 280           Merc 280C          Merc 450SE 
          20.450041           18.900144           18.900144           15.533127 
         Merc 450SL         Merc 450SLC  Cadillac Fleetwood Lincoln Continental 
          17.350247           17.083024            9.226650            8.296712 
  Chrysler Imperial            Fiat 128         Honda Civic      Toyota Corolla 
           8.718926           25.527289           28.653805           27.478021 
      Toyota Corona    Dodge Challenger         AMC Javelin          Camaro Z28 
          24.111004           18.472586           18.926866           16.762355 
   Pontiac Firebird           Fiat X1-9       Porsche 914-2        Lotus Europa 
          16.735633           26.943574           25.847957           29.198941 
     Ford Pantera L        Ferrari Dino       Maserati Bora          Volvo 142E 
          20.343151           22.480940           18.205363           22.427495

From this, we could calculate the sum of squared residuals.

So, now, let’s put this lm() function to work and test your understanding.

Exercise 1

Make a sequence of X values and Y values that are correlated. Give them some kind of hypothetical values in real life for two features that are plausibly correlated (drinking sugary drinks and number of cavities, etc.)

Make at least 20 pairs of X and Y corresponding to 20 units of observation (e.g. 20 people who drink different amounts of sugary drinks and have different numbers of cavities).

You can do this however you like in R. There are simple ways and slightly more complicated but faster ways. Choose whatever is easy.

Use lm() to estimate the regression coefficients.

Write the regression equation: ______

Find the predicted values using your regression equation, without $fitted.values. Use only the coefficients from the regression equation and the two vectors X and Y. Put X, Y, and the predicted values into three columns of a data frame.

Without using $residuals, calculate the vector of residuals. Add a column of the residuals to your data frame.

Find the sum of squared residuals.

Now, pick a slope and intercept for a “bad” regression line that doesn’t seem to capture the shape of the underlying data very well.

Intercept: ____

Slope: ____

Make a scatterplot of X and Y. Use abline() to ddd the “good” regression line using the intercept and slope from lm(). Then, use abline() to add the “bad” regression line. (Note: you are welcome to do this with ggplot() if you prefer).

Calculate the predicted values for your “bad” regression line and add them to your data frame in a column called “bad_predict.”

Calculate a column of “bad” residuals:

Calculate the sum of square residuals from your “bad” regression line.

How do the sum of squared residuals from the “good” regression line compare to the sum of square residuals for the “bad” regression line? (I really hope it’s higher)

Rhetorical question (no response needed): Do you see the SSE helps us to pick the “best” line?

1pt Extra Credit:

Make a matrix of plots using ggplot() that shows the line segments from predicted to actual values for your two different regression lines, similar to what I did above.

1pt Extra Credit: If you have good R skills, you should be able to pretty easily up spin up 10 different candidate lines for a scatterplot, plot the lines onto a scatterplot, and quickly estimate 10 SSE. See what you can do!

One might ask - is our regression line any “good”? Does is capture the relationship between X and Y very well? We’ll get to that later.

Additional Exercises

Do Exercises 5.1-5.4 of TCWD.