set.seed(123)
<-runif(100,1,10)
x<- (0+x*2) + rnorm(100,0,1)
y
<-cov(x,y)/var(x)
slope slope
[1] 1.990019
plot(x,y,xlim=c(0,15))
abline(coef=c(0,slope))
We ended up in the last lesson calculating the slope of the regression line, which summarizes the relationship between two vectors of data. Regression analysis is the workhorse of much of modern social science and policy analysis, so we need to go much deeper into this topic.
As we say, the slope of the regression line is just a normal rise-over-run slope value, and it allows us to make predictions about a value of Y based on X. In a scatterplot of two variables against each other, if the line has an intercept of 0, we just move up across x-axis from the the 0 intercept along the line, going up the y-axis according to the slope value - for each one unit you move right, you move vertically according to the slope coefficient.
Now, our guess might not be the same as the actual value, but that is our best guess given the means and variances of the two variables, and their covariance. In other words, it is our best guess given the structure of the data we have from this sample.
So in this example, the calculated slope of the regression line is about 2, to figure out the value Y when X is 5, given that the y-intercept here is 0, we move 2 units up each time we move 1 unit right, so our best guess of Y when X equals 5 is about 10.
set.seed(123)
<-runif(100,1,10)
x<- (0+x*2) + rnorm(100,0,1)
y
<-cov(x,y)/var(x)
slope slope
[1] 1.990019
plot(x,y,xlim=c(0,15))
abline(coef=c(0,slope))
Here, I’ve changed the data so that the y-intercept is 10 and slope of the regression line is negative 3. So, our best guess about the value of y at x=5 is 10 + (5*-3), which is -5.
set.seed(123)
<-runif(100,1,10)
x<- (10+x*-3) + rnorm(100,0,1)
y
<-cov(x,y)/var(x)
slope slope
[1] -3.009981
plot(x,y,xlim=c(0,15),ylim=c(-20,10))
abline(coef=c(10,slope))
We can turn this intuition into what is called a regression equation, which would look like this:
\[ E[y_i] = \alpha + \beta*x_i \]
In English, we would say that our expected (i.e., predicted) value of Y is equal to the intercept (Greek alpha, \(\alpha\)) plus the slope coefficient for (X,Y) (Greek \(\beta\) ) multiplied by an an element of the X vector. This is short hand for exactly what we just did with rise-over-run.
Many terms are used when describing a regression equation. \(\alpha\) and \(\beta\) are parameters. The outcome is the predicted value, or the dependent variable. x is the independent variable. The regression equation expresses the linear relationship and X and Y, because we can literally draw this line on our scatterplot.
You might notice that our description of our data using the regression equation is incomplete. The regression equation summarizes the line of best fit, but it doesn’t account for the actual variation we see in the cloud of points. Most of our actual points are not on the line of best fit. To account for this, we add to the regression equation the error term, Greek letter epsilon, \(\epsilon\).
Error in a regression equation is equal to the difference between the predicted value minus the actual value for each pair of points in the two vectors, so:
\[ error_i = actual_i - predicted_i \] We add this into the regression equation, so:
\[ E[y_i] = \alpha + \beta*x_i + \epsilon_i \]
It’s easy to visualize error. Here is a scatterplot of two variables.
# https://rpubs.com/iabrady/residual-analysis
<- mtcars
d
ggplot(d, aes(x = wt, y = mpg)) +
geom_point()
Add a regression line.
ggplot(d, aes(x = wt, y = mpg)) +
geom_point()+
geom_smooth(method="lm",se = FALSE)
The error for each point is the distance from the point to the lines that predicts values. The length of each of these lines is called a residual, and collective the the vector of these distances is the residuals.
<- lm(mpg ~ wt, data = d) # fit the model
fit $predicted <- predict(fit) # Save the predicted values
d$residuals <- residuals(fit) # Save the residual values
d
ggplot(d, aes(x = wt, y = mpg)) +
geom_point()+
geom_smooth(method="lm",se = FALSE)+
geom_segment(aes(xend = wt, yend = predicted), alpha = .2)
In the last lesson, we waved our hands at how we knew that the slope we calculated using cov(x,y)/var(x) was the best possible slope that characterized the relationship between X and Y. Now, we can see more clearly why this is.
The reason why we pick this line is that, of all possible lines, this line we draw is the one that yields the minimum the sum of squared errors (SSE). To say it again slightly differently: this line minmizes the SSE. The SSE is what you would get if you calculated the difference between each predicted value and the actual value (remembering that there could be negative values here), squared each of those values (so you get all positive values), and then added all those values up. To use some technical language possibly incorrectly, the SSE is kind of the like a measure of the deviation of the actual points from the predicted points. We want to minimize this deviation, and SSE provides us a mathematical avenue for doing this.
We are not going to get into the mathematical reasons for why it is the slope coefficient is always going to be the line that minimizes the SSE. Very smart statisticians and mathematicians have figured this out for us, and they have implemented computer algorithms that use matrix algebra to calculate the parameters of the regression line with lightning speed. With these tools, it is incredibly easy to take a vector of outcomes and vector of predictors and estimate the intercept and slope coefficient for a regression equation, and also the calculate the sum of squared errors.
In R, the function that implements the matrix algebra we need to do this is lm(). You might have noticed that I used this function in above when I made the residual plot.
I’m going to do the same thing and walk through it. We use lm() to fit the model, which means to ask R to calculate the parameters. R uses matrix algebra and does this in seconds.
<- mtcars ### Import data
d
<-lm(mpg ~ wt, data = d) # fit the model - this means "calculate the parameters"
mod
### show me the coefficients
print(mod)
Call:
lm(formula = mpg ~ wt, data = d)
Coefficients:
(Intercept) wt
37.285 -5.344
## resulting model object is a list that contains within it additional information
This object “mod” that we saved the lm output to is a list object in R that contains within it a lot of information. You can see this information with names().
names(mod)
[1] "coefficients" "residuals" "effects" "rank"
[5] "fitted.values" "assign" "qr" "df.residual"
[9] "xlevels" "call" "terms" "model"
Looking at this list, you see the mod object includes the coefficients.
$coefficients mod
(Intercept) wt
37.285126 -5.344472
And the residuals, which are labeled here.
$residuals mod
Mazda RX4 Mazda RX4 Wag Datsun 710 Hornet 4 Drive
-2.2826106 -0.9197704 -2.0859521 1.2973499
Hornet Sportabout Valiant Duster 360 Merc 240D
-0.2001440 -0.6932545 -3.9053627 4.1637381
Merc 230 Merc 280 Merc 280C Merc 450SE
2.3499593 0.2998560 -1.1001440 0.8668731
Merc 450SL Merc 450SLC Cadillac Fleetwood Lincoln Continental
-0.0502472 -1.8830236 1.1733496 2.1032876
Chrysler Imperial Fiat 128 Honda Civic Toyota Corolla
5.9810744 6.8727113 1.7461954 6.4219792
Toyota Corona Dodge Challenger AMC Javelin Camaro Z28
-2.6110037 -2.9725862 -3.7268663 -3.4623553
Pontiac Firebird Fiat X1-9 Porsche 914-2 Lotus Europa
2.4643670 0.3564263 0.1520430 1.2010593
Ford Pantera L Ferrari Dino Maserati Bora Volvo 142E
-4.5431513 -2.7809399 -3.2053627 -1.0274952
And the predicted values, which are called “fitted” values here.
$fitted.values mod
Mazda RX4 Mazda RX4 Wag Datsun 710 Hornet 4 Drive
23.282611 21.919770 24.885952 20.102650
Hornet Sportabout Valiant Duster 360 Merc 240D
18.900144 18.793255 18.205363 20.236262
Merc 230 Merc 280 Merc 280C Merc 450SE
20.450041 18.900144 18.900144 15.533127
Merc 450SL Merc 450SLC Cadillac Fleetwood Lincoln Continental
17.350247 17.083024 9.226650 8.296712
Chrysler Imperial Fiat 128 Honda Civic Toyota Corolla
8.718926 25.527289 28.653805 27.478021
Toyota Corona Dodge Challenger AMC Javelin Camaro Z28
24.111004 18.472586 18.926866 16.762355
Pontiac Firebird Fiat X1-9 Porsche 914-2 Lotus Europa
16.735633 26.943574 25.847957 29.198941
Ford Pantera L Ferrari Dino Maserati Bora Volvo 142E
20.343151 22.480940 18.205363 22.427495
From this, we could calculate the sum of squared residuals.
So, now, let’s put this lm() function to work and test your understanding.
Make a sequence of X values and Y values that are correlated. Give them some kind of hypothetical values in real life for two features that are plausibly correlated (drinking sugary drinks and number of cavities, etc.)
Make at least 20 pairs of X and Y corresponding to 20 units of observation (e.g. 20 people who drink different amounts of sugary drinks and have different numbers of cavities).
You can do this however you like in R. There are simple ways and slightly more complicated but faster ways. Choose whatever is easy.
Use lm() to estimate the regression coefficients.
Write the regression equation: ______
Find the predicted values using your regression equation, without $fitted.values. Use only the coefficients from the regression equation and the two vectors X and Y. Put X, Y, and the predicted values into three columns of a data frame.
Without using $residuals, calculate the vector of residuals. Add a column of the residuals to your data frame.
Find the sum of squared residuals.
Now, pick a slope and intercept for a “bad” regression line that doesn’t seem to capture the shape of the underlying data very well.
Intercept: ____
Slope: ____
Make a scatterplot of X and Y. Use abline() to ddd the “good” regression line using the intercept and slope from lm(). Then, use abline() to add the “bad” regression line. (Note: you are welcome to do this with ggplot() if you prefer).
Calculate the predicted values for your “bad” regression line and add them to your data frame in a column called “bad_predict.”
Calculate a column of “bad” residuals:
Calculate the sum of square residuals from your “bad” regression line.
How do the sum of squared residuals from the “good” regression line compare to the sum of square residuals for the “bad” regression line? (I really hope it’s higher)
Rhetorical question (no response needed): Do you see the SSE helps us to pick the “best” line?
1pt Extra Credit:
Make a matrix of plots using ggplot() that shows the line segments from predicted to actual values for your two different regression lines, similar to what I did above.
1pt Extra Credit: If you have good R skills, you should be able to pretty easily up spin up 10 different candidate lines for a scatterplot, plot the lines onto a scatterplot, and quickly estimate 10 SSE. See what you can do!
One might ask - is our regression line any “good”? Does is capture the relationship between X and Y very well? We’ll get to that later.
Do Exercises 5.1-5.4 of TCWD.