Functions Tasks
cooks.distance() Calculates cook’s distance.
text() Draws text labels at the coordinates given by x and y.
ifelse() Returns conditional outputs based on the test.
loessLine() Draws nonparametric regression lines.
predict() Make predictions using regression objects. If no new data is supplied, the data used for the fitting is used.
vif() Calculates variance inflation factors. From ‘car’ package.
bptest() Performs Breusch-Pagan Test. From ‘lmtest’ package.
coeftest() Performs z and t Wald tests using estimated regression coefficients. From ‘sandwich’ package.

1. Introduction

Lab 7 is heavily based on the lecture on Oct 3 (Thu). We will have some hands-on experience of detecting and dealing with the issues of (1) outliers, (2) non-linearity, (3) multicollinearity, and (4) heteroskedasticity.

2. Setting the working directory & reading data

Let’s first download required packages in R. In addition to the functions in base R, we will use regsubsets() in “leaps”, ggpairs() in “GGally”, and stargazer in “stargazer” package. There is one .csv dataset that you can download from Canvas > Files > Lab Materials named ‘lab6_gsv_census.csv’. Download the file and store it in a folder of your choice.

# Install required packages
install.packages("GGally")
install.packages("stargazer")
install.packages("car")
install.packages("lmtest")
install.packages("sandwich")
# Call the packages using library()
library(GGally)
library(stargazer)
library(car)
library(lmtest)
library(sandwich)
# USE YOUR OWN working directory and file name
setwd("C:/Users/Bonwoo Koo/Dropbox/School/CP6025/Labs/Lab7") # Use your own pathname
df1 <- read.csv("non_linearity.csv")
df2 <- read.csv("multi_hetero.csv")

str(df1)
## 'data.frame':    118 obs. of  17 variables:
##  $ X           : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ walkscore   : num  0.338 0.451 0.645 0.76 0.707 ...
##  $ dist.cbd    : num  4495 3587 4382 3528 3595 ...
##  $ b2s         : num  0.0548 0.0469 0.1497 0.1375 0.1421 ...
##  $ green       : num  0.436 0.407 0.297 0.332 0.316 ...
##  $ s2s         : num  0.0564 0.0685 0.1112 0.1504 0.1344 ...
##  $ gsv.index   : num  0.0117 -0.1493 0.4972 1.7117 1.1917 ...
##  $ p.age65     : num  0.0785 0.1311 0.0684 0.0789 0.1407 ...
##  $ p.service   : num  0.0492 0.0971 0.1769 0.0829 0.1324 ...
##  $ p.crowd.room: num  0.0051 0 0.00482 0.00696 0 ...
##  $ p.hs        : num  0.01316 0.01008 0.02415 0.00474 0.06532 ...
##  $ p.unemp     : num  0.0178 0.0194 0.0298 0.0172 0.0982 ...
##  $ p.pov       : num  0.0804 0.0919 0.0876 0.0699 0.213 ...
##  $ popden      : num  1894 1393 2125 2521 1961 ...
##  $ mroom       : num  7.3 5.6 6.4 4.9 5.3 4.5 5.1 6.1 5.7 5.8 ...
##  $ mval        : num  617300 548200 470300 433000 239600 ...
##  $ crime.percap: num  0.275 0.513 0.397 0.835 2.272 ...
str(df2)
## 'data.frame':    501 obs. of  16 variables:
##  $ X          : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ TractID    : num  1.31e+10 1.31e+10 1.31e+10 1.31e+10 1.31e+10 ...
##  $ walkscore  : num  18.8 26.2 37.4 31.7 33.4 ...
##  $ cbd_dist   : num  12909 15683 15359 11215 11418 ...
##  $ in.atl     : Factor w/ 2 levels "inATL","outATL": 2 2 2 1 1 2 1 2 2 2 ...
##  $ tract      : Factor w/ 501 levels "Census Tract 1",..: 381 382 383 384 385 386 387 388 389 390 ...
##  $ county     : Factor w/ 4 levels "Clayton County",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ state      : Factor w/ 1 level "Georgia": 1 1 1 1 1 1 1 1 1 1 ...
##  $ mval       : num  98600 98200 105000 55400 59300 69100 60500 89700 99200 88100 ...
##  $ hinc       : int  31524 36786 39194 33190 37236 37064 25159 45768 37224 42280 ...
##  $ nroom      : num  5.6 5.2 4.5 4.9 5.4 6.2 4.7 5.5 5.6 6.1 ...
##  $ unit.single: int  614 992 843 1447 2047 1659 1253 1315 1442 1880 ...
##  $ unit.multi : int  397 704 1210 876 629 86 752 98 1408 221 ...
##  $ unitstr.tot: int  1011 1696 2064 2323 2730 1772 2038 1465 2891 2129 ...
##  $ p.single   : num  0.607 0.585 0.408 0.623 0.75 ...
##  $ p.multi    : num  0.393 0.415 0.586 0.377 0.23 ...

 

3. Outliers

Problem & Detection

The presence of one or more outliers can sometimes bias the regression results substantially. The impact of each observation on the regression model can be measured using what is called Cook’s distance.

# plot cook's distance
outlier.reg <- lm(mval ~ I(p.hs*100) + dist.cbd + mroom, data = df1)
summary(outlier.reg)
## 
## Call:
## lm(formula = mval ~ I(p.hs * 100) + dist.cbd + mroom, data = df1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -354025  -97872  -20653   71217  534635 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    88171.186  67510.658   1.306    0.194    
## I(p.hs * 100) -15739.849   1730.344  -9.096 3.52e-15 ***
## dist.cbd          -4.631      5.447  -0.850    0.397    
## mroom          67963.505  14248.303   4.770 5.50e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 153300 on 114 degrees of freedom
## Multiple R-squared:  0.5073, Adjusted R-squared:  0.4943 
## F-statistic: 39.13 on 3 and 114 DF,  p-value: < 2.2e-16
# Calculate Cook's distance
cook.dist <- cooks.distance(outlier.reg)

# Visualize cook.dist
plot(cook.dist, pch="*", cex=2, main="Influential Obs by Cooks distance")

abline(h = 4*mean(cook.dist, na.rm=T), col="red") # add cutoff line

text(x = 1:length(cook.dist) + 5, # length() portion is added to shift the red texts to the right
     y = cook.dist, 
     labels = ifelse(cook.dist > 4*mean(cook.dist, na.rm=T), # Conditional statement
                     names(cook.dist), # The thing to do for data points for which the condition above is TRUE
                     ""), # The thing to do for data points for which the condition above is FALSE
     col="red") 

Solution

One way to deal with outlier issue is to do a log-transformation of the dependent variable. As shown in the image below, log-transformation makes the distance between data points with large values relatively close and the distance between data points with smaller values relatively large. After transformation, a few points with very large values in the image below are closer to each other as well as to other points as a whole.

The plot below shows the Cook’s distance after the log-transformation.

outlier.reg.log <- lm(log(mval) ~ I(p.hs*100) + dist.cbd + mroom, data = df1)

# Calculate Cook's distance
cook.dist.log <- cooks.distance(outlier.reg.log)

# Visualize cook.dist
plot(cook.dist.log, pch="*", cex=2, main="Influential Obs by Cooks distance")

abline(h = 4*mean(cook.dist.log, na.rm=T), col="red") # add cutoff line

text(x = 1:length(cook.dist.log) + 5, # length() portion is added to shift the red texts to the right
     y = cook.dist.log, 
     labels = ifelse(cook.dist.log > 4*mean(cook.dist, na.rm=T), # Conditional statement
                     names(cook.dist.log), # The cook.dist.log to do for data points for which the condition above is TRUE
                     ""), # The thing to do for data points for which the condition above is FALSE
     col="red") 

3. Non-linearity

Problem & Detection

Non-linear relationhsips are commonly seen in planning data sets. One example from the lecture was the relationship between the age of trees and the size of trees (e.g., as measured by the size of crown, diameter of the trunk, etc.). Younger trees tend to grow fast and they slow down as they get older. If you plot this relationship, you would see a curve with steep upward slopes in the beginning which slowsly plateaus out as trees get older. If a linear line is drawn to represent this type of data, the residuals may show undesirable patterns.

# Using df1 
# Original condition before the log-transformation
plot(df1$p.hs, df1$mval,
     xlab = "% Under Poverty",
     ylab = "Median Home Value")                      # scatterplot 
loessLine(df1$p.hs, df1$mval, col = "blue")   # smooth line
abline(lm(mval ~ p.hs, data = df1), col = "orange")  # Regression line

Solution

Non-linear relationships need to be linearized before proper regression parameters can be derived. As presented in the lecture, log-transformation and polynomial terms (quadratic/cubic forms of equation) are some of the options you can choose for linearization.

Log-Transformation: Notice that the two models below (before.log and after.log) are identical to outlier.reg and outlier.reg.log which were created above for outliers. Also note that taking log will cause problem if the variable you are transforming contains negative or zero values. If your data has values that are equal to or greater than zero, you can add a very small number to your data (e.g., 0.01) to prevent taking log of zero (try log(0) to see what happens).

In the stargazer output below, you can see two major changes. First, the size of the coefficient has changed dramatically. This reflect the fact that log-transformation of mval changes the interpretation of the coefficient (e.g., 1% increase in I(p.hs*100) -> y changes by the factor of exp(-0.068)). Second, the R2 and Adjusted R2 are increased substantially, indicating a much improved model fit (F-statistic is larger too).

# Comparing regression results with and without log-transformation
before.log <- lm(mval ~ I(p.hs*100), data = df1) # Before log
after.log <- lm(log(mval) ~ I(p.hs*100), data = df1) # After log
stargazer(before.log, after.log, 
          type = "text",
          column.labels = c("Normal Reg", "Logged y variable"))
## 
## ===============================================================
##                                      Dependent variable:       
##                                --------------------------------
##                                     mval          log(mval)    
##                                  Normal Reg   Logged y variable
##                                     (1)              (2)       
## ---------------------------------------------------------------
## I(p.hs * 100)                  -16,295.730***     -0.068***    
##                                 (1,875.076)        (0.006)     
##                                                                
## Constant                       415,075.100***     12.809***    
##                                 (23,646.000)       (0.077)     
##                                                                
## ---------------------------------------------------------------
## Observations                        118              118       
## R2                                 0.394            0.516      
## Adjusted R2                        0.389            0.512      
## Residual Std. Error (df = 116)  168,448.100         0.546      
## F Statistic (df = 1; 116)        75.528***       123.531***    
## ===============================================================
## Note:                               *p<0.1; **p<0.05; ***p<0.01

How to read output tables from stargazer()

The output of the stargazer gives two numbers for each independent variable, one outside the parenthesis and the other in the parenthesis. For example, the statistics for p.hs in the second column of the stargazer ouput is -0.068 and 0.006. Here, -0.068 is the regression coefficient (i.e., Estimate in regression outputs in R) and 0.006 is the stand error of the coefficient estimates (i.e., Std. Error in regression outputs in R). Although stargazer does not provide t-values of the coefficient estimates, it is very easy to calculate them if you want to see them. Simply divide -0.068 by 0.006. That will give you t-value for p.hs in the after.log model (if you see a slight different between your calculation and summary(after.log), that is because of rounding issue).

# Plotting the relationship after the log-transformation
plot(df1$p.hs, log(df1$mval),
     xlab = "% Under Poverty",
     ylab = "(Log) Median Home Value",
     main = "Log(DV) vs. IV") # scatterplot 
loessLine(df1$p.hs, log(df1$mval), col = "blue")   # smooth line
abline(lm(log(mval) ~ p.hs, data = df1), col = "orange") # Regression line  (LOGGED)

 

Polynomial term: Another option of transformation is to add squared or cubic terms of an independent variable into the equation. This turns linear regression model into a curve. A quadratic term creates one hump, and a cubic term creates two humps. Whether to include polynomial terms can be determined based on (1) theory (e.g., the tree age and tree size example), (2) visual inspection of your data, and (3) residual plot.

The comparison of the three models shown below shows increasing Adjusted R2 values as quadratic and cubic terms are included in the model.

# Comparing regression models
normal.reg <- lm(mval ~ p.hs, 
                 data = df1) # Before a squared term

quad.reg <- lm(mval ~ p.hs + I(p.hs^2), 
               data = df1) # After the squared term

cubic.reg <- lm(mval ~ p.hs + I(p.hs^2) + I(p.hs^3), 
                data = df1) # After the cubic term

cubic.reg <- lm(mval ~ p.hs + I(p.hs^2) + I(p.hs^3), 
                data = df1) # After the cubic term

stargazer(normal.reg, quad.reg, cubic.reg, 
          type = "text",
          column.labels = c("Normal Reg", "Quadratic", "Cubic"))
## 
## ===========================================================================================
##                                               Dependent variable:                          
##                     -----------------------------------------------------------------------
##                                                      mval                                  
##                           Normal Reg               Quadratic                 Cubic         
##                               (1)                     (2)                     (3)          
## -------------------------------------------------------------------------------------------
## p.hs                   -1,629,573.000***       -4,407,921.000***       -7,270,817.000***   
##                          (187,507.600)           (507,393.800)          (1,077,260.000)    
##                                                                                            
## I(p.hs2)                                       11,031,010.000***       36,848,233.000***   
##                                                 (1,904,094.000)         (8,841,774.000)    
##                                                                                            
## I(p.hs3)                                                              -57,832,746.000***   
##                                                                        (19,371,843.000)    
##                                                                                            
## Constant                415,075.100***          504,157.000***          553,325.500***     
##                          (23,646.000)            (25,942.670)            (30,015.760)      
##                                                                                            
## -------------------------------------------------------------------------------------------
## Observations                  118                     118                     118          
## R2                           0.394                   0.531                   0.565         
## Adjusted R2                  0.389                   0.523                   0.554         
## Residual Std. Error 168,448.100 (df = 116)  148,847.200 (df = 115)  143,976.400 (df = 114) 
## F Statistic         75.528*** (df = 1; 116) 65.146*** (df = 2; 115) 49.390*** (df = 3; 114)
## ===========================================================================================
## Note:                                                           *p<0.1; **p<0.05; ***p<0.01

Plotting these curvy fitted lines is a bit tricky because you need to make predictions yourself in order to be able to plot them. Although I highly encourage you to put efforts to make sense out of the code, it is not required to do so.

Notice how the fitted lines curve and how the curved lines fit the data better compared to the stright regression line (orange line).

# Plotting the polynomial curve
prd <- data.frame(p.hs = sort(df1$p.hs)) # Create a new data.frame with one variable, a sorted p.hs 
prd$predict.norm <- predict(normal.reg, newdata = prd) # Make prediction of the model without polynomial terms using the new data.frame and attach the prediction back to the new data

prd$predict.quad <- predict(quad.reg, newdata = prd) # Do the same as the line above except it uses quad.reg
prd$predict.cubic <- predict(cubic.reg, newdata = prd) # Do the same as the line above except it uses cubic.reg

# Plot it using ggplot package
ggplot(data = prd) + 
  geom_line(aes(x = p.hs, y = predict.norm), color = "orange", size = 0.5, linetype = "dashed") +
  geom_line(aes(x = p.hs, y = predict.quad), color = "grey", size = 2) +
  geom_line(aes(x = p.hs, y = predict.cubic), color = "black", size = 2, linetype = "dashed") +
  geom_point(data = df1, aes(x = p.hs, y = mval), col = "dark grey") +
  theme_bw() +
  labs(y = "Median Home Value",
       x = "% Less Than Highschool Education (Demon: Age25+)",
       title = "Median Home Value vs. % Less than High School (Grey = quadratic, Black = cubic)")

 

4. Multicollinearity

Problem

Multicollinearity arises when two or more independent variables in a multiple regression model are highly related. This makes it difficult for a regression model to tease out unique effects coming from two correlated independent variables. In the presence of multicollinearity, the model uses reduced amount of information for the estimation, which, in turn, increases the variance of the estimates (i.e., standard error of the coefficient estimate). This then increases the p-value and make it hard for us to reject the null hypothesis.

Let’s look at the relationship between variables that we will use as an example. Notice that p.single and p.multi are almost perfectly correlated.

# Starting to use df2 from here
# ggpairs
ggpairs(df2[ ,c("mval", "p.single", "p.multi")])

 

Detection

The Variance Inflation Factors (VIF) is perhaps the most widely used method for detecting multicollinearity. Before we look into how to calculate VIFs, there is a function in “car” package which will do it all for you with one line of code. See below for an example.

# ols regression using IVs that have severe multicoliinearity problem
multicol.reg <- lm(mval ~ p.single + p.multi, data = df2)

# Calculate VIF
car::vif(multicol.reg) 
## p.single  p.multi 
## 123.0338 123.0338
# Using :: is another way of loading libraries. 'car::vif' means you want to use vif function in 'car' package. If you specify package names in this way, you don't need to call the package using library (but you will need to put 'car::' prefix for every function call).

The correlation between p.single and p.multi was -0.996. What would happen if you have perfect collinearity (i.e., correlation coefficient of -1 or 1)? First, let’s create a variable by subtracting p.single from 1 and call it p.non.single. Then, let’s include both p.single and p.non.single into a regression model. See below for a demonstration.

As shown below, the correlation coefficient is -1, yielding a perfect straight line. This creates a perfect (multi)collinearity problem, which can be represented as perfectly overlapping circles in the Venn Diagram (Remember the diagram from the slide). Because the regression model don’t have any non-overlapping region between p.single and p.non.single, it cannot determine the individual effect of p.single or p.non.single. As the result, the model forgoes one of the variable and gives estimates of only one of the two perfectly collinear variables. Also note that, in the presence of perfect multicollinearity, vif() function in ‘car’ package will fail.

# Artificially creating a perfectly collinear variables
df2$p.non.single <- 1 - df2$p.single

ggpairs(df2[,c("p.non.single", "p.single")]) # scatterplot with a perfect straight line & correlation coefficient of -1

# Let's see what happens in a regression
ols2 <- lm(mval ~ in.atl + p.single + p.non.single, data = df2)
summary(ols2) 
## 
## Call:
## lm(formula = mval ~ in.atl + p.single + p.non.single, data = df2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -221042 -117431  -41790   94932  850426 
## 
## Coefficients: (1 not defined because of singularities)
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    232810      18848  12.352  < 2e-16 ***
## in.atloutATL   -57142      16045  -3.561 0.000404 ***
## p.single        55767      26402   2.112 0.035162 *  
## p.non.single       NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 160900 on 498 degrees of freedom
## Multiple R-squared:  0.02731,    Adjusted R-squared:  0.0234 
## F-statistic: 6.991 on 2 and 498 DF,  p-value: 0.001013
# Print the model below and compare it with ols2. They should be identical.
compare.ols <- lm(mval ~ in.atl + p.single, data = df2)

 

Solution

  1. Use much larger sample size
  2. Live with it, and be careful in interpreting “insignificant” results
  3. Use factor or principal components analysis (beyond our scope)
  4. Eliuminate one or more highly collinear variables (dangerous)

 

5. Heteroskedasticity (or sometime spelled as heteroscedasticity)

Problem

Heteroskedasticity means unequal (conditional) variance. In the context of regression, we talk about heteroskedasticity of error terms (i.e., residuals). If the spread of residuals systematically changes (i.e., changes with a pattern), it is a violation of one of the OLS assumptions.

Heteroskedasticity can be caused naturally. For example, “… the variability of expenditures may increase with income. Richer families may spend a similar amount on groceries as poorer people, but some rich families will sometimes buy expensive items such as lobster. The variability of expenditures for rich families is thus quite large. However, the expenditures on food of poorer families, who cannot afford lobster, will not vary much. Heteroskedasticity can also appear when data is clustered; for example, variability of expenditures on food may vary from city to city, but is quite constant within a city” (B. Rodrigues 2018, Accessed Oct 6, 2019. https://www.brodrigues.co/blog/2018-07-08-rob_stderr/).

When we have heteroskedastic residuals, it does not bias the coefficient estimates but makes the standard errors unreliable.

# Creating a model with heteroskedastic residuals
hetero.reg1 <- lm(mval ~ walkscore + nroom + I(p.single*100) + in.atl, data = df2) 
summary(hetero.reg1)
## 
## Call:
## lm(formula = mval ~ walkscore + nroom + I(p.single * 100) + in.atl, 
##     data = df2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -318829  -75655  -19480   60077  615234 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       -372190.6    41835.4  -8.897  < 2e-16 ***
## walkscore            2581.1      408.9   6.313 6.09e-10 ***
## nroom              140607.5     7381.5  19.049  < 2e-16 ***
## I(p.single * 100)   -4246.2      361.5 -11.746  < 2e-16 ***
## in.atloutATL       -54283.5    13703.8  -3.961 8.55e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 122400 on 496 degrees of freedom
## Multiple R-squared:  0.4393, Adjusted R-squared:  0.4348 
## F-statistic: 97.17 on 4 and 496 DF,  p-value: < 2.2e-16

 

Detection

Again, a residual plot can be used for a visual inspection (these are the reasons why the importance of residual plots are emphasized in the lecture over and over). Ideally, you’d want to see no distinctive pattern in the plot (i.e., a complete random distribution of points). In the plot below, we see that the residuals vary more when predicted values are higher.

# Residual plot
plot(predict(hetero.reg1), 
     hetero.reg1$residuals, 
     xlab = "Predicted values",
     ylab = "Residuals",
     main = "Residuals vs. Predicted")
abline(h = 0, col = 'red', lty = 2)

Breusch-Pagan test can be used to test the presence of heteroskedasticity in residuals from a regression. You can use bptest() in “lmtest” package. The null hypothesis of this test is that the residuals are homoskedastic. In the test output below, we have an extremely small p-value, indicating that we can reject the null hypothesis of homoskedasticity.

# Breusch-Pagan test
bptest(hetero.reg1)
## 
##  studentized Breusch-Pagan test
## 
## data:  hetero.reg1
## BP = 85.433, df = 4, p-value < 2.2e-16

 

Solution

Transformation: One solution is to transform our data, particularly the dependent variable. Because heteroskedasticity can sometime be caused by other misspecification issues (e.g., non-linearity), the solution can be the same at times. Also note that the transformation does not guarantee homoskedasticity. See below for a demonstration.

# Visualize error term after log-transforming the y-variable
hetero.reg2 <- lm(log(mval) ~ walkscore + nroom + I(p.single*100) + in.atl, data = df2) # Notice the log-transformation of DV
summary(hetero.reg2)
## 
## Call:
## lm(formula = log(mval) ~ walkscore + nroom + I(p.single * 100) + 
##     in.atl, data = df2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.46353 -0.34475 -0.00622  0.33672  1.61816 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        9.381334   0.175133  53.567   <2e-16 ***
## walkscore          0.015154   0.001712   8.854   <2e-16 ***
## nroom              0.559174   0.030901  18.096   <2e-16 ***
## I(p.single * 100) -0.015906   0.001513 -10.511   <2e-16 ***
## in.atloutATL      -0.009509   0.057367  -0.166    0.868    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5125 on 496 degrees of freedom
## Multiple R-squared:  0.4137, Adjusted R-squared:  0.4089 
## F-statistic: 87.49 on 4 and 496 DF,  p-value: < 2.2e-16
# Residual plot
plot(predict(hetero.reg2), 
     hetero.reg2$residuals, 
     xlab = "Predicted values",
     ylab = "Residuals",
     main = "Residuals vs. Predicted")
abline(h = 0, col = 'red', lty = 2)

The log-transformation alleviates the problem, but what we see is not a complete equal spread of points across the predicted values. To see if the heteroskedasticity is completely gone, try bptest(hetero.reg2).

Heteroskedasticity-robust standard errors: Another option is to use heteroskedasticity-robust standard errors. The following code is based on two packages, ‘lmtest’ and ‘sandwich’. Comparing the robust estimates with the original estimates shows that the Estimates are not changed; it is the Std. Error (and therefore t-value and p-value) that is changed. You can see that the Std. Errors are generally higher in the robust estimates. You don’t need to memorize the code – you know where to find it.

coeftest(hetero.reg1, vcov = vcovHC(hetero.reg1, type = "HC0"))
## 
## t test of coefficients:
## 
##                     Estimate Std. Error t value  Pr(>|t|)    
## (Intercept)       -372190.63   39688.49 -9.3778 < 2.2e-16 ***
## walkscore            2581.14     456.58  5.6532  2.66e-08 ***
## nroom              140607.47    8694.10 16.1728 < 2.2e-16 ***
## I(p.single * 100)   -4246.23     450.29 -9.4300 < 2.2e-16 ***
## in.atloutATL       -54283.46   18764.70 -2.8928  0.003985 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1