I was reading about Adjustment and finding it difficult to get my head around the examples in the recommended STDS course reading. Looking around for other examples, I found Jun Chen’s blog “Confounders and Adjustment.” The example that Jun gives using the Amazon books dataset made the concept of adjustment easier for me to understand. Based on Jun’s example, I have been able to identify confounding variables in the dataset for my group assignment.
Our dataset is geared around discovering the impact that earthquakes have on house prices. We have included additional variables in our dataset that we believe could explain house prices, such as crime rate, income, unemployment rate and population.
We’re interested in the relationship between the binary group variable Destructive Earthquakes (1 = an earthquake with a magnitude high enough to cause property damage occurred in that city in that year, 0 = no destructive earthquakes occurred in that year) and the outcome: House Price. As per Jun’s approach to identifying confounders:
“We’re concerned that the relationship might be distorted or confounded by a third variable, we’re going to use plots and compare the change of coefficient with and without the inclusion of the third variable in the model to investigate whether different attributes have the confounding effect on the relationship.”
In my team’s case, we’re concerned that the relationship between Destructive Earthquakes and House Price might be impacted by the inclusion of a third variables. In this example we’re going to use plots and compare the change of coefficient with and without the inclusion of Crime Rate.
yr_quakes %>%
ggplot(mapping = aes(x = crime_rate, y = house_price, colour = des_quake)) +
geom_point(alpha = 0.3) +
geom_hline(yintercept = c(438.7198, 452.9788), colour = c("cyan3", "coral"), linetype = "dashed") +
geom_smooth(method = 'lm', se=FALSE) +
labs(colour='Quakes Y/N')
## `geom_smooth()` using formula 'y ~ x'
The dashed horizontal parallel lines are the price averages of houses when there has been a destructive earthquake and when there hasn’t been a destructive earthquake, disregarding the crime rate. According to Jun’s blog, “the distance between is known as the marginal effect of group status.” Interestingly, the average house price is slightly higher when there has not been a destructive earthquake. The fitted lines of the estimated model incorporating crime rate are not parallel lines. This indicates that crime has an impact on the relationship between house price and destructive earthquakes.
Looking at the model and coefficients -
fit1 <- lm(house_price~des_quake, data = yr_quakes)
summary(fit1)$coef
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 452.97881 5.780369 78.365031 0.00000000
## des_quake1 -14.25905 7.714579 -1.848325 0.06461838
fit2 <- lm(house_price~des_quake+crime_rate, data = yr_quakes)
summary(fit2)$coef
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 504.32643 9.638198 52.32580 0.000000e+00
## des_quake1 62.46001 11.590279 5.38900 7.585091e-08
## crime_rate -551.92037 30.703807 -17.97563 5.531097e-69
The inclusion of crime changes the estimate by -11.33% and changes the coefficient by 538.03%. In this case, we can say that crime rate has a confounding effect on the relationship between house price and destructive earthquakes.
There is a negative linear relationship between crime rate and house prices for years that there has been a destructive earthquake, which seems reasonable as you would expect that house prices would decline after an earthquake. Discussing this further with one of our STDS subject tutors, Leonie Payne, she recommended looking more closely at the coefficients of the models: the coefficient for des_quakes1 when crime rate is not included is -14.25905, changing to 62.46001 when including crime rate. This suggests that house prices increase when there has been an earthquake and the model controls for crime. Leonie also suggested that there might be at least one more confounding variable that hasn’t been included in this model yet that is causing these inconsistencies.
We can conclude that when we are modelling the impact of destructive earthquakes on house prices, we will need to control for crime rate and potentially some other confounding variables that I am yet to uncover. Big thanks to Jun Chen for making the concept of adjustment and confounding variables easier to understand. And big thanks to Leonie Payne for her feedback and for helping to point me in the right direction. After working through this example, I definitely have a better idea of this concept.
How to control for crime rate in our model is the next question.. fingers crossed Jun Chen has another easy-to-follow blog post on that topic too!