Regressions and Interactions

library(socsci)
library(jtools)
library(interactions)
df <- read_csv("https://raw.githubusercontent.com/ryanburge/pls2003_sp17/master/living.csv")

Take a look at our data

df

## # A tibble: 3,617 x 9
##       X1 weight height  male  educ physical stamps white black
##    <dbl>  <dbl>  <dbl> <dbl> <dbl>    <dbl>  <dbl> <dbl> <dbl>
##  1     1    208     68     0     3        1      0     0     1
##  2     2    250     73     1    12        3      0     1     0
##  3     3    165     69     1     9        5      0     1     0
##  4     4    140     67     1    10        3      0     1     0
##  5     5    100     61     0    14        2      0     1     0
##  6     6    160     67     1     7        2      0     1     0
##  7     7    213     63     0    12        3      1     0     1
##  8     8    220     65     0     9        1      0     0     1
##  9     9    130     66     0    12        4      0     0     1
## 10    10    130     68     1     9        5      0     1     0
## # ... with 3,607 more rows

This is just a basic dataset of heights and weights of a bunch of individuals alongside a number of other variables including race, gender, education, whether they are on foodstamps, and their level of physical activity

A very simple regression

We want to predict weight and we want to use height to do it, here’s the syntax:

lm(weight ~ height, data = df)

## 
## Call:
## lm(formula = weight ~ height, data = df)
## 
## Coefficients:
## (Intercept)       height  
##    -135.440        4.499

Okay, this looks complicated and not very helpful - let’s make it easier to read and understand by using the summ function from the jtools package.

reg <- lm(weight ~ height, data = df)
summ(reg)

## MODEL INFO:
## Observations: 3538 (79 missing obs. deleted)
## Dependent Variable: weight
## Type: OLS linear regression 
## 
## MODEL FIT:
## F(1,3536) = 1168.66, p = 0.00
## R² = 0.25
## Adj. R² = 0.25 
## 
## Standard errors: OLS
## --------------------------------------------------
##                        Est.   S.E.   t val.      p
## ----------------- --------- ------ -------- ------
## (Intercept)         -135.44   8.71   -15.54   0.00
## height                 4.50   0.13    34.19   0.00
## --------------------------------------------------

We can see things like the number of observations here: 3538, but we have a few missing observations. It tells us that we have weight as our dependent variable and it tells us that we have an OLS regression. All fine. We also see that the Adjusted R2 (which is a measurement of how well our model fits) is .25, which isn’t great. But, we can focus on that later.

The important things here are our coefficient estimates. As you can see, for the height variable, that estimate is 4.5. Which means for every inch taller someone is, their weight should increase by 4.5 pounds.

Using coefficient estimates to predict things

Let’s say that someone is 64 inches tall. How much should they weigh based on our model? We can do that like so:

64*4.5

## [1] 288

The result there is 288 pounds. That seems high doesn’t it? Well, it’s because we aren’t done yet. We still need to subract the y-intercept (135.44 pounds)

288 - 135.44

## [1] 152.56

So, our model predicts that this 64 inch tall person would weight about 152.5 pounds. But, don’t more factors determine our weight than just our height? Of course they do.

Adding a control variable

Let’s take the prior equation and just add a control variable for white respondents. We do that like so:

reg <- lm(weight ~ height + white, data = df)
summ(reg)

## MODEL INFO:
## Observations: 3538 (79 missing obs. deleted)
## Dependent Variable: weight
## Type: OLS linear regression 
## 
## MODEL FIT:
## F(2,3535) = 638.34, p = 0.00
## R² = 0.27
## Adj. R² = 0.26 
## 
## Standard errors: OLS
## --------------------------------------------------
##                        Est.   S.E.   t val.      p
## ----------------- --------- ------ -------- ------
## (Intercept)         -133.50   8.62   -15.49   0.00
## height                 4.56   0.13    34.99   0.00
## white                 -9.60   1.06    -9.02   0.00
## --------------------------------------------------

First, notice that our R squared got just a bit better. So, we are moving in the right direction. But how we do use our new model to predict weight?

Let’s say that our person is a 70 inch white person. How much should they weigh?

4.56*70

## [1] 319.2

The answer is 319.2, but remember that this person is white. So the coefficient for white is -9.6. So we need to subtract that.

319.2-9.6

## [1] 309.6

And then we take 309.6 and subtract our y-intercept

309.6-133.50

## [1] 176.1

This hypothetical person should weight 176.1 pounds according to our model.

But, what if our person wasn’t white? Well that value is 0 in our dataset and 0*-9.6 is zero. So we wouldn’t add or subtract anything at that step. It wouldn’t change our overall coefficient.

Visualizing Regression Output

It’s always better to visualize regression coefficients. Let’s do that with the plot_coefs function.

plot_coefs(reg)

If the point estimate is to the right of zero and it (or the horizontal blue line) doesn’t intercept with zero, then this variable increases weight. If it’s to the left of zero and doesn’t intersect with zero, then it decreases weight. We can see that height increases weight, while being white decreases weight.

Statistical Significance

But, sometimes variables don’t have a clear impact on a dependent variable. Here’s an example.

reg <- lm(weight ~ height + white + black, data = df)
summ(reg)

## MODEL INFO:
## Observations: 3538 (79 missing obs. deleted)
## Dependent Variable: weight
## Type: OLS linear regression 
## 
## MODEL FIT:
## F(3,3534) = 431.52, p = 0.00
## R² = 0.27
## Adj. R² = 0.27 
## 
## Standard errors: OLS
## --------------------------------------------------
##                        Est.   S.E.   t val.      p
## ----------------- --------- ------ -------- ------
## (Intercept)         -138.33   8.71   -15.89   0.00
## height                 4.53   0.13    34.79   0.00
## white                 -3.01   2.09    -1.44   0.15
## black                  7.99   2.18     3.66   0.00
## --------------------------------------------------

Notice the last column on the right here. That’s the p-value. If the p-value is .05 or less, then your coefficient is statistically significant. If the p-value is great that .05 than our variable is not statistically significant. But, what does that actually mean? Visualizing it will help make more sense of this concept.

plot_coefs(reg)

Notice how the blue line for the white variable intersects with the vertical dashed line that’s on zero? What that means is that the point estimate for that coefficient could be anywhere along that blue line. Which means that being white could mean lower weight or it could mean higher weight. And, if we don’t know which one it could be then we don’t have statistical significance. In other words: in this model we can’t say whether being white makes you more or less likely to weigh more.

A More Complex Regression Model

Let’s add a couple more variables to our model and see what generates a higher weight estimate and which one generates a lower weight estimate.

reg <- lm(weight ~ height + white + black + educ + stamps + physical + male, data = df)
plot_coefs(reg)

We can see that increasing levels of education lead to lower weights as do higher levels of physical activity. We see that increased height, being male, being black, and being on food stamps also increases weight.

summ(reg)

## MODEL INFO:
## Observations: 3538 (79 missing obs. deleted)
## Dependent Variable: weight
## Type: OLS linear regression 
## 
## MODEL FIT:
## F(7,3530) = 197.00, p = 0.00
## R² = 0.28
## Adj. R² = 0.28 
## 
## Standard errors: OLS
## ---------------------------------------------------
##                        Est.    S.E.   t val.      p
## ----------------- --------- ------- -------- ------
## (Intercept)         -112.15   11.45    -9.80   0.00
## height                 4.26    0.18    23.42   0.00
## white                 -0.30    2.11    -0.14   0.89
## black                  8.45    2.18     3.88   0.00
## educ                  -0.70    0.16    -4.32   0.00
## stamps                 4.71    2.16     2.18   0.03
## physical              -1.44    0.37    -3.86   0.00
## male                   5.07    1.49     3.41   0.00
## ---------------------------------------------------

In practical terms we can say this: holding height, race, gender, food stamps, and level of physical activity constant, a one grade level increase in education translates to an individual weighing .7 fewer pounds.

Interactions

One of the newest ways to use regression is through an interaction. Let’s say we wanted to see how height impacts weight for white people vs people of color. We can easily do that in a simple interaction.

int <- lm(weight ~ height*white, data = df)
summ(int)

## MODEL INFO:
## Observations: 3538 (79 missing obs. deleted)
## Dependent Variable: weight
## Type: OLS linear regression 
## 
## MODEL FIT:
## F(3,3534) = 433.61, p = 0.00
## R² = 0.27
## Adj. R² = 0.27 
## 
## Standard errors: OLS
## ---------------------------------------------------
##                        Est.    S.E.   t val.      p
## ------------------ -------- ------- -------- ------
## (Intercept)          -87.25   13.89    -6.28   0.00
## height                 3.86    0.21    18.32   0.00
## white                -84.54   17.70    -4.78   0.00
## height:white           1.14    0.27     4.24   0.00
## ---------------------------------------------------

But, please don’t try to interpret an interaction through a coefficient table. It’s really hard to do that. Let’s visualize it.

interact_plot(int, pred = height, modx = white, interval = TRUE)

We can see that for shorter white people, they weigh less than shorter people of color. However, as heights increase the two lines begin to converge and once height gets above 70 inches there’s no statistical difference between white and non-white respondents.

Three Way Interactions

Let’s say we wanted to look at the same interaction, but add a gender variable. We can do that.

int <- lm(weight ~ height*white*male, data = df)
interact_plot(int, pred = height, modx = white, mod2 = male, interval = TRUE)

The left panel is women, the right panel is men. The solid line is white respondents, the dashed line represents people of color.

For men, there’s basically no difference for white vs. non-white. However, for women there is a difference in the middle of the height distribution but not at either end.

Making That Interaction Graph Better

All those 1’s and 0’s get hard to decipher sometimes. Let’s make it easier to interpret.

df <- df %>% 
  mutate(white_new = frcode(white == 1 ~ "White",
                            white == 0 ~ "Non-White")) %>% 
  mutate(male_new = frcode(male == 1 ~ "Men",
                       male == 0 ~ "Women"))

int <- lm(weight ~ height*white_new*male_new, data = df)
interact_plot(int, pred = height, modx = white_new, mod2 = male_new, interval = TRUE, mod2.labels = c("Men", "Women"))