Functions Tasks
cooks.distance() Calculates cook’s distance.
text() Draws text labels at the coordinates given by x and y.
ifelse() Returns conditional outputs based on the test.
loessLine() Draws nonparametric regression lines.
predict() Make predictions using regression objects. If no new data is supplied, the data used for the fitting is used.
vif() Calculates variance inflation factors. From ‘car’ package.
bptest() Performs Breusch-Pagan Test. From ‘lmtest’ package.
coeftest() Performs z and t Wald tests using estimated regression coefficients. From ‘sandwich’ package.

1. Introduction

We will have some hands-on experience on detecting and dealing with the issues of (1) outliers, (2) non-linearity, (3) multicollinearity, and (4) heteroskedasticity.The data files for this lab include Week11_multi_hetero.csv (this data was used in lab 8) and Week11_non_linearity.csv (this data was used in lab 9, week 10). Check lab 8 and lab 9 for the definitions of the variables in the data used for this lab.

2. Setting the working directory & reading data

Let’s first download required packages in R (If you’ve already downloaded them, you don’t need to do it again). Download ‘non_linearity.csv’ and ‘multi_hetero.csv’ from Canvas and store it in a folder of your choice.

# Install required packages
install.packages("GGally")
install.packages("stargazer")
install.packages("car")
install.packages("lmtest")
install.packages("sandwich")
install.packages("interactions")
# Call the packages using library()
library(GGally)
library(stargazer)
library(car)
library(lmtest)
library(sandwich)
library(interactions)
# USE YOUR OWN working directory and file name
setwd("C:\\Users\\cod\\Desktop\\PhD Files\\GRA & GTA\\GTA\\CP6025 Fall 2021\\Week 11\\Lab9_2020") # Use your own pathname
df1 <- read.csv("Week11_non_linearity.csv")
df2 <- read.csv("Week11_multi_hetero.csv")

str(df1)
## 'data.frame':    118 obs. of  17 variables:
##  $ X           : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ walkscore   : num  0.338 0.451 0.645 0.76 0.707 ...
##  $ dist.cbd    : num  4495 3587 4382 3528 3595 ...
##  $ b2s         : num  0.0548 0.0469 0.1497 0.1375 0.1421 ...
##  $ green       : num  0.436 0.407 0.297 0.332 0.316 ...
##  $ s2s         : num  0.0564 0.0685 0.1112 0.1504 0.1344 ...
##  $ gsv.index   : num  0.0117 -0.1493 0.4972 1.7117 1.1917 ...
##  $ p.age65     : num  0.0785 0.1311 0.0684 0.0789 0.1407 ...
##  $ p.service   : num  0.0492 0.0971 0.1769 0.0829 0.1324 ...
##  $ p.crowd.room: num  0.0051 0 0.00482 0.00696 0 ...
##  $ p.hs        : num  0.01316 0.01008 0.02415 0.00474 0.06532 ...
##  $ p.unemp     : num  0.0178 0.0194 0.0298 0.0172 0.0982 ...
##  $ p.pov       : num  0.0804 0.0919 0.0876 0.0699 0.213 ...
##  $ popden      : num  1894 1393 2125 2521 1961 ...
##  $ mroom       : num  7.3 5.6 6.4 4.9 5.3 4.5 5.1 6.1 5.7 5.8 ...
##  $ mval        : num  617300 548200 470300 433000 239600 ...
##  $ crime.percap: num  0.275 0.513 0.397 0.835 2.272 ...
str(df2)
## 'data.frame':    501 obs. of  16 variables:
##  $ X          : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ TractID    : num  1.31e+10 1.31e+10 1.31e+10 1.31e+10 1.31e+10 ...
##  $ walkscore  : num  18.8 26.2 37.4 31.7 33.4 ...
##  $ cbd_dist   : num  12909 15683 15359 11215 11418 ...
##  $ in.atl     : chr  "outATL" "outATL" "outATL" "inATL" ...
##  $ tract      : chr  "Census Tract 402.02" "Census Tract 402.03" "Census Tract 402.04" "Census Tract 403.02" ...
##  $ county     : chr  "Clayton County" "Clayton County" "Clayton County" "Clayton County" ...
##  $ state      : chr  "Georgia" "Georgia" "Georgia" "Georgia" ...
##  $ mval       : num  98600 98200 105000 55400 59300 69100 60500 89700 99200 88100 ...
##  $ hinc       : int  31524 36786 39194 33190 37236 37064 25159 45768 37224 42280 ...
##  $ nroom      : num  5.6 5.2 4.5 4.9 5.4 6.2 4.7 5.5 5.6 6.1 ...
##  $ unit.single: int  614 992 843 1447 2047 1659 1253 1315 1442 1880 ...
##  $ unit.multi : int  397 704 1210 876 629 86 752 98 1408 221 ...
##  $ unitstr.tot: int  1011 1696 2064 2323 2730 1772 2038 1465 2891 2129 ...
##  $ p.single   : num  0.607 0.585 0.408 0.623 0.75 ...
##  $ p.multi    : num  0.393 0.415 0.586 0.377 0.23 ...

 

3. Outliers

Problem & Detection

The presence of one or more outliers can sometimes bias the regression results substantially. The impact of each observation on the regression model can be measured using what is called Cook’s distance.

The abline() adds a threshold line. Note that some literature recommends against using a numerical threshold and advice to look for “values of D that are substantially larger than the rest” (for more information, see Fox, John. (1991). Regression Diagnostics: An Introduction. Sage Publications.) ?cook.dist

# plot cook's distance
outlier.reg <- lm(mval ~ I(p.hs*100) + dist.cbd + mroom, data = df1)
summary(outlier.reg)
## 
## Call:
## lm(formula = mval ~ I(p.hs * 100) + dist.cbd + mroom, data = df1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -354025  -97872  -20653   71217  534635 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    88171.186  67510.658   1.306    0.194    
## I(p.hs * 100) -15739.849   1730.344  -9.096 3.52e-15 ***
## dist.cbd          -4.631      5.447  -0.850    0.397    
## mroom          67963.505  14248.303   4.770 5.50e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 153300 on 114 degrees of freedom
## Multiple R-squared:  0.5073, Adjusted R-squared:  0.4943 
## F-statistic: 39.13 on 3 and 114 DF,  p-value: < 2.2e-16
# Calculate Cook's distance
cook.dist <- cooks.distance(outlier.reg)

# Visualize cook.dist
plot(cook.dist, pch="*", cex=2, main="Influential Obs by Cooks distance")

abline(h = 4*mean(cook.dist, na.rm=T), col="red") # add threshold line. 

text(x = 1:length(cook.dist) + 5, # length() portion is added to shift the red texts to the right
     y = cook.dist, 
     col="red",
     labels = ifelse(cook.dist > 4*mean(cook.dist, na.rm=T), # Conditional statement
                     names(cook.dist), # What to do with data points that meet the condition
                     "")) # What to do with data points that do not meet the condition 

Solution

One way to deal with outlier issue is to do a log-transformation of the dependent variable. As shown in the image below, log-transformation makes the distance between data points with large values relatively close and the distance between data points with smaller values relatively large. After transformation, a few points with very large values in the image below are closer to each other as well as to other points as a whole.Note, log transformation is not a cure all as it changes theunderlying relation in the data

The plot below shows the Cook’s distance after the log-transformation.

outlier.reg.log <- lm(log(mval) ~ I(p.hs*100) + dist.cbd + mroom, data = df1)

# Calculate Cook's distance
cook.dist.log <- cooks.distance(outlier.reg.log)

# Visualize cook.dist
plot(cook.dist.log, pch="*", cex=2, main="Influential Obs by Cooks distance")

abline(h = 4*mean(cook.dist.log, na.rm=T), col="red") # add cutoff line

text(x = 1:length(cook.dist.log) + 5, # length() portion is added to shift the red texts to the right
     y = cook.dist.log, 
     col="red",
     labels = ifelse(cook.dist.log > 4*mean(cook.dist, na.rm=T), # Conditional statement
                     names(cook.dist.log), # What to do with data points that meet the condition
                     "")) # What to do with data points that do not meet the condition 

4. Non-linearity

Problem & Detection

Non-linear relationship are commonly seen in planning data sets. One example is the relationship between the age of trees and the size of trees (e.g., as measured by the size of crown, diameter of the trunk, etc.). Younger trees tend to grow fast and they slow down as they get older. If you plot this relationship, you would see a curve with steep upward slopes when trees are younger which then slowly plateaus out as trees get older. If a linear line is drawn to represent this type of data, the residuals may show undesirable patterns.

Shown below is a scatterplot between % under high school graduation and median home value. Notice that the data points form a convex shape. The orange line shows the regression line. The blue line shows a curvy line that follows the data points more closely.

# Using df1 
# Original condition before the log-transformation
plot(df1$p.hs, df1$mval,
     xlab = "% Under high school graduation",
     ylab = "Median Home Value") # scatterplot 

loessLine(df1$p.hs, df1$mval, col = "blue") # smooth line

abline(lm(mval ~ p.hs, data = df1), col = "orange")  # Regression line

Solution

Non-linear relationships need to be linearized before proper regression parameters can be derived. As presented in the lecture, log-transformation and polynomial terms (quadratic/cubic forms of equation) are some of the options you can choose for linearization.Note, interactions are another way to linearize but we will look at these in section 7

Log-Transformation: Notice that the two models below (before.log and after.log) are identical to outlier.reg and outlier.reg.log which were created above for outliers. Also note that taking log will cause problem if the variable you are transforming contains negative or zero values. If the variable you are looking to log-transform has zero(s), you can add a very small number to your data (e.g., 0.01) to prevent taking log of zero (try log(0) to see what happens).

In the stargazer output below, you can see two major changes. First, the size of the coefficient has changed dramatically. This reflect the fact that log-transformation of mval changes the interpretation of the coefficient (e.g., 1% increase in I(p.hs*100) -> y changes by the factor of exp(-0.068) percent). Second, the R2 and Adjusted R2 are increased substantially, indicating a much improved model fit (F-statistic is larger too).

# Comparing regression results with and without log-transformation
before.log <- lm(mval ~ I(p.hs*100), data = df1) # Before log
after.log <- lm(log(mval) ~ I(p.hs*100), data = df1) # After log
stargazer(before.log, after.log, 
          type = "text",
          column.labels = c("Normal Reg", "Logged y variable"))
## 
## ===============================================================
##                                      Dependent variable:       
##                                --------------------------------
##                                     mval          log(mval)    
##                                  Normal Reg   Logged y variable
##                                     (1)              (2)       
## ---------------------------------------------------------------
## I(p.hs * 100)                  -16,295.730***     -0.068***    
##                                 (1,875.076)        (0.006)     
##                                                                
## Constant                       415,075.100***     12.809***    
##                                 (23,646.000)       (0.077)     
##                                                                
## ---------------------------------------------------------------
## Observations                        118              118       
## R2                                 0.394            0.516      
## Adjusted R2                        0.389            0.512      
## Residual Std. Error (df = 116)  168,448.100         0.546      
## F Statistic (df = 1; 116)        75.528***       123.531***    
## ===============================================================
## Note:                               *p<0.1; **p<0.05; ***p<0.01

How to read output tables from stargazer()

The output of the stargazer gives two numbers for each independent variable, one outside the parenthesis and the other in the parenthesis. For example, the statistics for p.hs in the second column of the stargazer output is -0.068 and 0.006. Here, -0.068 is the regression coefficient (i.e., Estimate in regression outputs in R) and 0.006 is the standard error of the coefficient estimates (i.e., Std. Error in regression outputs in R). Although stargazer does not provide t-values of the coefficient estimates, it is very easy to calculate them if you want to see them. Simply divide -0.068 by 0.006. That will give you t-value for p.hs in the after.log model (if you see a slight different between your calculation and summary(after.log), that is because of rounding issue).

# Plotting the relationship after the log-transformation
plot(df1$p.hs, log(df1$mval),
     xlab = "% Under Poverty",
     ylab = "(Log) Median Home Value",
     main = "Log(DV) vs. IV") # scatterplot 
loessLine(df1$p.hs, log(df1$mval), col = "blue")   # smooth line
abline(lm(log(mval) ~ p.hs, data = df1), col = "orange") # Regression line  (LOGGED)

 

Polynomial term: Another option of transformation is to add squared or cubic terms of an independent variable into the equation. This turns the straight regression line into a curve. A quadratic term creates one hump, and a cubic term creates two humps. Whether to include polynomial terms can be determined based on (1) theory (as always, I would stress this as the most important criterion) and (2) visual inspection of your data.

The comparison of the three models shown below shows increasing Adjusted R2 values as quadratic and cubic terms are included in the model.

# Comparing regression models
normal.reg <- lm(mval ~ p.hs, 
                 data = df1) # Before a squared term

quad.reg <- lm(mval ~ p.hs + I(p.hs^2), 
               data = df1) # After the squared term

cubic.reg <- lm(mval ~ p.hs + I(p.hs^2) + I(p.hs^3), 
                data = df1) # After the cubic term

stargazer(normal.reg, quad.reg, cubic.reg, 
          type = "text",
          column.labels = c("Normal Reg", "Quadratic", "Cubic"))
## 
## ===========================================================================================
##                                               Dependent variable:                          
##                     -----------------------------------------------------------------------
##                                                      mval                                  
##                           Normal Reg               Quadratic                 Cubic         
##                               (1)                     (2)                     (3)          
## -------------------------------------------------------------------------------------------
## p.hs                   -1,629,573.000***       -4,407,921.000***       -7,270,817.000***   
##                          (187,507.600)           (507,393.800)          (1,077,260.000)    
##                                                                                            
## I(p.hs2)                                       11,031,010.000***       36,848,233.000***   
##                                                 (1,904,094.000)         (8,841,774.000)    
##                                                                                            
## I(p.hs3)                                                              -57,832,746.000***   
##                                                                        (19,371,843.000)    
##                                                                                            
## Constant                415,075.100***          504,157.000***          553,325.500***     
##                          (23,646.000)            (25,942.670)            (30,015.760)      
##                                                                                            
## -------------------------------------------------------------------------------------------
## Observations                  118                     118                     118          
## R2                           0.394                   0.531                   0.565         
## Adjusted R2                  0.389                   0.523                   0.554         
## Residual Std. Error 168,448.100 (df = 116)  148,847.200 (df = 115)  143,976.400 (df = 114) 
## F Statistic         75.528*** (df = 1; 116) 65.146*** (df = 2; 115) 49.390*** (df = 3; 114)
## ===========================================================================================
## Note:                                                           *p<0.1; **p<0.05; ***p<0.01

Plotting these curvy fitted lines is a bit tricky because you need to make predictions yourself in order to be able to plot them. Although I highly encourage you to put efforts to make sense out of the code, it is not required to do so.

Notice how the fitted lines curve and how the curved lines fit the data better compared to the straight regression line (orange line).

# Plotting the polynomial curve
prd <- data.frame(p.hs = sort(df1$p.hs)) # Create a new data.frame with one variable, a sorted p.hs 
prd$predict.norm <- predict(normal.reg, newdata = prd) # Make prediction of the model without polynomial terms using the new data.frame and attach the prediction back to the new data

prd$predict.quad <- predict(quad.reg, newdata = prd) # Do the same as the line above except it uses quad.reg
prd$predict.cubic <- predict(cubic.reg, newdata = prd) # Do the same as the line above except it uses cubic.reg

# Plot it using ggplot package
ggplot(data = prd) + 
  geom_line(aes(x = p.hs, y = predict.norm), color = "orange", size = 0.5, linetype = "dashed") +
  geom_line(aes(x = p.hs, y = predict.quad), color = "grey", size = 2) +
  geom_line(aes(x = p.hs, y = predict.cubic), color = "black", size = 2, linetype = "dashed") +
  geom_point(data = df1, aes(x = p.hs, y = mval), col = "dark grey") +
  theme_bw() +
  labs(y = "Median Home Value",
       x = "% Less Than Highschool Education (Universe: Age25+)",
       title = "Median Home Value vs. % Less than High School \n (Grey = quadratic, Black = cubic)")

 

5. Multicollinearity

Problem

Multicollinearity arises when two or more independent variables in a multiple regression model are highly related. This makes it difficult for a regression model to tease out unique effects coming from two correlated independent variables. In the presence of multicollinearity, the model uses reduced amount of information for the estimation, which, in turn, increases the variance of the estimates (i.e., standard error of the coefficient estimate). This then increases the p-value and make it hard for us to reject the null hypothesis than it really should.

Let’s look at the relationship between variables that we will use as an example. Notice that p.single and p.multi are almost perfectly correlated.

# Starting to use df2 from here
# ggpairs
ggpairs(df2[ ,c("mval", "p.single", "p.multi")])

 

Detection

The Variance Inflation Factors (VIF) is perhaps the most widely used method for detecting multicollinearity. There is a function in “car” package which will do it all for you with one line of code. See below for an example.

# ols regression using IVs that have severe multicoliinearity problem
multicol.reg <- lm(mval ~ p.single + p.multi, data = df2)

# Calculate VIF
car::vif(multicol.reg) 
## p.single  p.multi 
## 123.0338 123.0338
# Using :: is another way of loading libraries. 'car::vif' means you want to use vif function in 'car' package. If you specify package names in this way, you don't need to call the package using library (but you will need to put 'car::' prefix for every function call).

The correlation between p.single and p.multi was -0.996. What would happen if you have perfect collinearity (i.e., correlation coefficient of -1 or 1)? First, let’s create a variable by subtracting p.single from 1 and call it p.non.single. Then, let’s include both p.single and p.non.single into a regression model. See below for a demonstration.

As shown below, the correlation coefficient is -1, yielding a perfect straight line. This creates a perfect (multi)collinearity problem. Because the regression model don’t have any non-overlapping region between p.single and p.non.single, it cannot determine the individual effect of p.single or p.non.single. As the result, the model forgoes one of the variable and gives estimates of only one of the two perfectly collinear variables. Also note that, in the presence of perfect multicollinearity, vif() function in ‘car’ package will fail.

# Artificially creating a perfectly collinear variables
df2$p.non.single <- 1 - df2$p.single

ggpairs(df2[,c("p.non.single", "p.single")]) # scatterplot with a perfect straight line & correlation coefficient of -1

# Let's see what happens in a regression
ols2 <- lm(mval ~ in.atl + p.single + p.non.single, data = df2)
summary(ols2) 
## 
## Call:
## lm(formula = mval ~ in.atl + p.single + p.non.single, data = df2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -221042 -117431  -41790   94932  850426 
## 
## Coefficients: (1 not defined because of singularities)
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    232810      18848  12.352  < 2e-16 ***
## in.atloutATL   -57142      16045  -3.561 0.000404 ***
## p.single        55767      26402   2.112 0.035162 *  
## p.non.single       NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 160900 on 498 degrees of freedom
## Multiple R-squared:  0.02731,    Adjusted R-squared:  0.0234 
## F-statistic: 6.991 on 2 and 498 DF,  p-value: 0.001013
# Print the model below and compare it with ols2. They should be identical.
compare.ols <- lm(mval ~ in.atl + p.single, data = df2)

 

Solution

  1. Use much larger sample size
  2. Live with it, and be careful in interpreting “insignificant” results
  3. Use factor or principal components analysis to create an index (beyond our scope)
  4. Eliminate one or more highly collinear variables (dangerous)

 

6. Heteroskedasticity (or sometime spelled as heteroscedasticity)

Problem

Heteroskedasticity means unequal (conditional) variance. In the context of regression, we talk about heteroskedasticity of error terms (i.e., residuals). If the spread of residuals systematically changes (i.e., changes with a pattern), it is a violation of one of the OLS assumptions.

Heteroskedasticity can be caused naturally. For example, “… the variability of expenditures may increase with income. Richer families may spend a similar amount on groceries as poorer people, but some rich families will sometimes buy expensive items such as lobster. The variability of expenditures for rich families is thus quite large. However, the expenditures on food of poorer families, who cannot afford lobster, will not vary much. Heteroskedasticity can also appear when data is clustered; for example, variability of expenditures on food may vary from city to city, but is quite constant within a city” (B. Rodrigues 2018, Accessed Oct 6, 2019. https://www.brodrigues.co/blog/2018-07-08-rob_stderr/).

When we have heteroskedastic residuals, it does not bias the coefficient estimates but makes the standard errors unreliable which impacts confidence intervals and hypothesis test.

# Creating a model with heteroskedastic residuals
hetero.reg1 <- lm(mval ~ walkscore + nroom + I(p.single*100) + in.atl, data = df2) 
summary(hetero.reg1)
## 
## Call:
## lm(formula = mval ~ walkscore + nroom + I(p.single * 100) + in.atl, 
##     data = df2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -318829  -75655  -19480   60077  615234 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       -372190.6    41835.4  -8.897  < 2e-16 ***
## walkscore            2581.1      408.9   6.313 6.09e-10 ***
## nroom              140607.5     7381.5  19.049  < 2e-16 ***
## I(p.single * 100)   -4246.2      361.5 -11.746  < 2e-16 ***
## in.atloutATL       -54283.5    13703.8  -3.961 8.55e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 122400 on 496 degrees of freedom
## Multiple R-squared:  0.4393, Adjusted R-squared:  0.4348 
## F-statistic: 97.17 on 4 and 496 DF,  p-value: < 2.2e-16

 

Detection

Again, a residual plot can be used for a visual inspection (these are the reasons why the importance of residual plots are emphasized in the lecture over and over). Ideally, you’d want to see no distinctive pattern in the plot (i.e., a complete random distribution of points). In the plot below, we see that the residuals vary more when predicted values are higher.

# Residual plot
plot(predict(hetero.reg1), 
     hetero.reg1$residuals, 
     xlab = "Predicted values",
     ylab = "Residuals",
     main = "Residuals vs. Predicted")
abline(h = 0, col = 'red', lty = 2)

Breusch-Pagan test can be used to test the presence of heteroskedasticity in residuals from a regression. You can use bptest() in “lmtest” package (or ncvTest() in “car” package).Note Dr. Clayton used ncvTest() in class but for this lab I am going to use bptest(). The null hypothesis of this test is that the residuals are homoskedastic. In the test output below, we have an extremely small p-value, indicating that we can reject the null hypothesis of homoskedasticity.

# Breusch-Pagan test
bptest(hetero.reg1)
## 
##  studentized Breusch-Pagan test
## 
## data:  hetero.reg1
## BP = 85.433, df = 4, p-value < 2.2e-16

 

Solution

Transformation: One solution is to transform our data, particularly the dependent variable. Because heteroskedasticity can sometime be caused by other misspecification issues (e.g., non-linearity), the solution can be the same at times. Also note that the transformation does not guarantee homoskedasticity. See below for a demonstration.

# Visualize error term after log-transforming the y-variable
hetero.reg2 <- lm(log(mval) ~ walkscore + nroom + I(p.single*100) + in.atl, data = df2) # Notice the log-transformation of DV
summary(hetero.reg2)
## 
## Call:
## lm(formula = log(mval) ~ walkscore + nroom + I(p.single * 100) + 
##     in.atl, data = df2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.46353 -0.34475 -0.00622  0.33672  1.61816 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        9.381334   0.175133  53.567   <2e-16 ***
## walkscore          0.015154   0.001712   8.854   <2e-16 ***
## nroom              0.559174   0.030901  18.096   <2e-16 ***
## I(p.single * 100) -0.015906   0.001513 -10.511   <2e-16 ***
## in.atloutATL      -0.009509   0.057367  -0.166    0.868    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5125 on 496 degrees of freedom
## Multiple R-squared:  0.4137, Adjusted R-squared:  0.4089 
## F-statistic: 87.49 on 4 and 496 DF,  p-value: < 2.2e-16
# Residual plot
plot(predict(hetero.reg2), 
     hetero.reg2$residuals, 
     xlab = "Predicted values",
     ylab = "Residuals",
     main = "Residuals vs. Predicted")
abline(h = 0, col = 'red', lty = 2)

The log-transformation alleviates the problem, but what we see is not a complete equal spread of points across the predicted values. To see if the heteroskedasticity is completely gone, try bptest(hetero.reg2).

Heteroskedasticity-robust standard errors: Another option is to use heteroskedasticity-robust standard errors. The following code is based on two packages, ‘lmtest’ and ‘sandwich’. Comparing the robust estimates with the original estimates shows that the Estimates are not changed; it is the Std. Error (and therefore t-value and p-value) that is changed. You can see that the Std. Errors are generally higher in the robust estimates. You don’t need to memorize the code – you know where to find it.

coeftest(hetero.reg1, vcov = vcovHC(hetero.reg1, type = "HC1"))
## 
## t test of coefficients:
## 
##                     Estimate Std. Error t value  Pr(>|t|)    
## (Intercept)       -372190.63   39888.03 -9.3309 < 2.2e-16 ***
## walkscore            2581.14     458.87  5.6250 3.104e-08 ***
## nroom              140607.47    8737.81 16.0918 < 2.2e-16 ***
## I(p.single * 100)   -4246.23     452.55 -9.3828 < 2.2e-16 ***
## in.atloutATL       -54283.46   18859.05 -2.8784   0.00417 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

 

7. Interaction terms in Regression

Interaction terms enter the regression equation as the product of two constitutive terms, \(x_1\) and \(x_2\). For this product term \(x_1x_3\), the regression equation adds a separate coefficient \(\beta_3\).

\(y = \alpha + \beta_1x_1 + \beta_2x_2 + \beta_3x_1x_2 + \epsilon\)

Example 1: one binary, one continuous term

In thinking about interaction terms, it helps to first simplify by working through the prediction of the regression equation for different values of two predictors, \(x_1\) and \(x_2\). We can imagine a continuous outcome \(y\), e.g. the income of Hollywood actors, that we predict with two variables. The first, \(x_1\), is a binary variable such as female gender; it takes on values of 0 (males) and 1 (females) only. The second, \(x_2\), is a continuous variable, ranging from −5 to +5, such as a (centered and standardized) measure of age. In this case, I’m using the term “effect” loosely and non-causally. An interaction term expresses the idea that the effect of one variable depends on the value of the other variable. With these variables, this suggests that effect of age on actors’ income is different for male actors than for female actors.

\(\beta_1\) is the effect of \(x_1\) on \(y\) when \(x_2\) is \(0\)

\(\hat{y}=\alpha+\beta_1x_1+\beta_2*0+\beta_3x_1*0\)

\(\hat{y}=\alpha+\beta_1x_1 + 0 + 0\)

\(\beta_2\) is the effect of \(x_2\) on \(y\) when \(x_1\) is \(0\)

\(\hat{y}=\alpha+\beta_1*0+\beta_2x_2+\beta_3*0*x_2\)

\(\hat{y}=\alpha+\beta_2x_2 + 0 + 0\)

When both \(x_1\) and \(x_2\) are not \(0\), \(\beta_3\) becomes important, and the effect of \(x_1\) now varies with the value of \(x_2\). We can plug in 1 for \(x_1\) and simplify the equation as follows:

\(\hat{y}=\alpha+\beta_1 * 1 + \beta_2x_2+\beta_3 * 1 * x_2\)

\(\hat{y}=\alpha + \beta_1 + \beta_2x_2 + \beta_3x_2\)

\(\hat{y} = (\alpha + \beta_1) + (\beta_2 + \beta_3) * x_2\)

 

With simulated data, this can be illustrated easily. I begin by simulating a dataset with 200 observations, two predictors \(x_1\) (binary: male/female) and \(x_2\) (continuous: standardized and centered age), and create \(y\) (continuous: income) as the linear combination of \(x_1\), \(x_2\), and an interaction term of the two predictors.

set.seed(123)
n.sample <- 200
x1 <- rbinom(n.sample, size = 1, prob =0.5)
x2 <- runif(n.sample, -5, 5)
a <- 5
b1 <- 3
b2 <- 4
b3 <- -3
e <- rnorm(n.sample, 0, 5)
y <- a + b1 * x1 + b2 * x2 + b3 * x1 * x2 + e
sim.dat <- data.frame(y,x1,x2)

 

Before advancing, this is what the simulated data look like:

par(mfrow = c(2,2)) # Create a 1 X 3 plotting matrix
#The next 3 plots created will be plotted next to each other
hist(sim.dat$x1)
hist(sim.dat$x2)
hist(sim.dat$y)

For convenience, you could also use the multi.hist() function from the “psych” package. It automatically adds a density curve (dashed line) and a normal density plot (dotted line).

 

Fitting a model using what we know about the data-generating process gives us:

mod.sim <- lm(y ~ x1 * x2, dat = sim.dat)
summary(mod.sim)
## 
## Call:
## lm(formula = y ~ x1 * x2, data = sim.dat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.1974  -3.4874  -0.0738   3.4837  13.2178 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   4.7891     0.4924   9.726  < 2e-16 ***
## x1            3.8716     0.7081   5.467 1.38e-07 ***
## x2            3.9554     0.1663  23.790  < 2e-16 ***
## x1:x2        -2.9435     0.2421 -12.160  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.997 on 196 degrees of freedom
## Multiple R-squared:  0.7615, Adjusted R-squared:  0.7579 
## F-statistic: 208.6 on 3 and 196 DF,  p-value: < 2.2e-16

Interaction effects: x1 and x2

This is for all cases where \(x_1=1\), what is the relationship between \(x_2\) and \(y\)?

interact_plot(mod.sim, pred = x2, modx = x1)

From this first exercise, you should take home a few things:

Specifying an interaction

Although the tutorial does not discuss this, in the prevalent case of using null hypothesis significance tests on interaction effects, interaction terms should be derived from theory.

In R, you specify an interaction term by putting an asterisk between the two constitutive terms: x1 * x2

You always need to include the constitutive terms in your model.

See Brambor et al. (2006) for more details. What would it imply if your model only included \(\beta_3x_1x_2\)? Omitting a coefficient is equivalent to setting \(\beta_1 = 0\) and \(\beta_2 = 0\). See for yourself:

mod.sim2 <- lm(y ~ x1:x2, data = sim.dat)
summary(mod.sim2)
## 
## Call:
## lm(formula = y ~ x1:x2, data = sim.dat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -25.6719  -4.5727   0.6632   5.8826  24.6120 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   6.6555     0.7078   9.404  < 2e-16 ***
## x1:x2         0.9588     0.3513   2.729  0.00692 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.995 on 198 degrees of freedom
## Multiple R-squared:  0.03625,    Adjusted R-squared:  0.03138 
## F-statistic: 7.448 on 1 and 198 DF,  p-value: 0.006924
interact_plot(mod.sim2, pred = x2, modx = x1)

Note: R will automatically include the constitutive terms when you use the asterisk as above.

y ~ x1 + x2 + x1 * x2 is equivalent to y ~ x1 * x2

source: Karreth’s tutorial (interaction terms) http://www.jkarreth.net/files/RPOS517_Day11_Interact.html