What is the market valuation for outdoor clean air in Medellín, as revealed by housing prices in a hedonic property model?
The policy problem is: How much are households willing to pay for better outdoor air quality, as revealed by housing prices?
Let’s start with a simple histogram to see if there are outliers or the distribution approaches normality. This data is current. I have a meeting this week with the Medellin Catastro, to discuss whether they will be able to provide several years of valuations. But in the meantime, I am doing everything based on 2018, which is a year I know there is data for all sources.
ggplot(data=hood, aes(x=log(SumaDeAvaluoTotal))) + geom_histogram(col="dark green",
fill="dark green",
alpha = .5) +
labs(x = "Valuation Sum", y = "Proportion",
title = "Sum of Valuation per Neighborhood in Medellin, 2018", fill = "Comuna",
caption = "Medellín Paper EDA, Rojas 2021")
Let’s group them by comuna and observe their variation in a box plot.
options(scipen = 100)
ggplot(data=hood, aes(x=nomComuna, y=SumaDeAvaluoTotal/1000000, fill=nomComuna)) +
geom_boxplot() + xlab("District") + ylab("Total District Valuation") +
coord_flip() + theme(legend.position="none")
There is enough variation, but this is in absolute terms. This can be adjusted by population size and in log mode. More on that in a bit. For now, now let’s check the level of rent per neighborhood. This variable is a composite from the Household Survey, which is how much they are paying of rent, and if they aren’t, how much they think they should pay. It is a mix of an objective and subjective valuation, which has a lot of methodological aspects which could be discussed. I also do some cleaning to make the barrio names match.
ggplot(data=merged_hood_ecv, aes(x=nomComuna, y=ave_valuation, fill=nomComuna)) +
geom_boxplot() + xlab("District") + ylab("Average District Valuation") +
coord_flip() + theme(legend.position="none")
###USING 2019 PM25 NEED TO FIX THIS
We can also compare the valuations and the levels of rent in the different Neighborhoods. We would expect an increasing relationship between both variables, which is what we see in this plot:
merged_hood_ecv %>% filter(log(rental) > 12) %>%
ggplot(aes(x=log(rental), y=log(SumaDeAvaluoTotal))) +
geom_point(aes(col = nomComuna,
alpha = .8)) +
labs(x = "Rental Price (log)", y = "Sum of Valuation (log)",
title = "Household Valuation and Rental Prices per Neighborhood, Medellin, 2018", fill = "Comuna", subtitle= "Grouped by Comuna",
caption = "Medellín Paper EDA, Rojas 2021") + theme(legend.position="none")
Here I do the same but rather than the net sum of valuation, I divide the valuation over the amount of households, to get an average valuation. It changes many of the observations, but not the overall direction of the relationship.
merged_hood_ecv %>% filter(log(rental) > 12) %>%
ggplot(aes(x=log(rental), y=log(ave_valuation))) +
geom_point(aes(col = nomComuna,
alpha = .8)) +
labs(x = "Rental Price (log)", y = "Average of Valuation (log)",
title = "Household Valuation and Rental Prices per Neighborhood, Medellin, 2018", fill = "Comuna", subtitle= "Grouped by Comuna",
caption = "Medellín Paper EDA, Rojas 2021") + theme(legend.position="none")
Now it’s time to start observing the neighborhood prices, and its relationship with Air Quality Stations. Here, I spatially match each station to its closest neighborhood. I am using 2018 average for each PM2.5 value for the AQ stations. In total, only 12 stations end up matching, but there is still some variation between those.
merged_hood_ecv %>% filter(log(rental) > 12) %>%
ggplot(aes(y=meanpm25, x=log(ave_valuation))) +
geom_point(aes(col = nomComuna,
alpha = .8)) +
labs(y = "Mean PM25", x = "Average of Valuation (log)",
title = "Household Valuation and Mean PM2.5 per Neighborhood", fill = "Comuna", subtitle= "Grouped by Comuna",
caption = "Medellín Paper EDA, Rojas 2021") + theme(legend.position="none")
Carriazo 2018, suggests using Precipitation and Wind as instruments in housing markets. So, there are 796 stations for precipitation, and for wind. Most of them, also have match per neighborhood, so there will be a lot of variation in those data, and they will be much easier to match with neighborhoods. Where I am only able to match 12 AQ stations to 160 neighborhoods in Medellín, I might be able to find one station per 1 or 2 neighborhoods, which is a much much better measurement.
However, downloading those data has to be done manually, per month. That means that there are 12 files per air quality station, per year. That’s 9552 files I have to download manually, per year (if I get better yearly valuations, that gets multiplied per each additional year I’m able to get). So, for illustration purposes, I downloaded a sample of 10 water stations, and there is enough variation where this data could be used, if there’s an analytical plan, but it’s impractical to download it all for EDA purposes.
We can observe some variation in the precipitation measurements:
rain_means <- read.csv("/Users/alfonso/Google Drive/Data Colombia/Medellin/rmean_2018.csv")
ggplot(data=rain_means, aes(wat_q)) + geom_density(col="dark blue",
fill="dark blue",
alpha = .5) +
labs(y = "Density", x = "Water Quality Value",
title = "Water Quality for 10-station Sample in Medellin, 2018",
caption = "Medellín Paper EDA, Rojas 2021")
We also have the Household Survey variables, which can be used in the model. I have those Household Surveys for every year from 2008 until 2020, so the restriction there is the Air Quality Data. A correlation plot for all of the variables provides some interesting (but expected) results:
for_corr <- merged_hood_ecv %>%
select(-c(codbarrio, estacion, nombre_bar, join_estacion, count,
tot_pop, year, nomComuna, codComuna, DS_BARRIO, pop))
corrmat <- cor(for_corr, use="complete.obs")
corrplot(corrmat)
## Exploratory Predictions
Now let’s run a couple of Exploratory Regressions.
model1 <- lm(formula = log(rental) ~ meanpm25 + female_prop + ave_pop + log(ave_income) + ave_hhsize + ave_age + ave_tenurelength +
ave_edlevel + ave_commtime + ave_rooms, data = merged_hood_ecv)
model2 <- lm(formula = log(ave_valuation) ~ meanpm25 + female_prop + ave_pop + log(ave_income) +
ave_hhsize + ave_age + ave_tenurelength +
ave_edlevel + ave_commtime + ave_rooms, data = merged_hood_ecv)
model3 <- plm(formula = log(rental) ~ meanpm25 + female_prop + ave_pop + log(ave_income) + ave_hhsize + ave_age + ave_tenurelength +
ave_edlevel + ave_commtime + ave_rooms, data = merged_hood_ecv, model="within", index=c("codComuna"))
model4 <- plm(formula = log(ave_valuation) ~ meanpm25 + female_prop + ave_pop + ave_hhsize + ave_age + ave_tenurelength +
ave_edlevel + ave_commtime + ave_rooms + log(ave_income), data = merged_hood_ecv, model="within", index=c("codComuna"))
stargazer(model1, model2, model3, model4, type = 'html')
| Dependent variable: | ||||
| log(rental) | log(ave_valuation) | log(rental) | log(ave_valuation) | |
| OLS | OLS | panel | panel | |
| linear | linear | |||
| (1) | (2) | (3) | (4) | |
| meanpm25 | 0.010 | 0.044** | 0.012 | 0.016 |
| (0.007) | (0.022) | (0.011) | (0.035) | |
| female_prop | 0.041 | -0.526 | 0.036 | -0.535 |
| (0.373) | (1.234) | (0.334) | (1.108) | |
| ave_pop | 0.944 | -36.175* | 2.287 | -21.433 |
| (5.951) | (19.707) | (5.965) | (19.819) | |
| log(ave_income) | 0.303*** | 0.392* | 0.180*** | 0.138 |
| (0.063) | (0.208) | (0.066) | (0.219) | |
| ave_hhsize | 0.135*** | 0.291** | 0.087** | 0.284** |
| (0.036) | (0.120) | (0.036) | (0.118) | |
| ave_age | 0.012** | 0.063*** | 0.006 | 0.055*** |
| (0.005) | (0.017) | (0.005) | (0.017) | |
| ave_tenurelength | -0.003 | -0.025** | -0.002 | -0.005 |
| (0.003) | (0.010) | (0.003) | (0.010) | |
| ave_edlevel | 0.259*** | 0.512*** | 0.270*** | 0.324** |
| (0.040) | (0.134) | (0.040) | (0.131) | |
| ave_commtime | -0.007*** | -0.014* | -0.006** | -0.010 |
| (0.003) | (0.008) | (0.003) | (0.009) | |
| ave_rooms | 0.119*** | 0.068 | 0.108*** | 0.131* |
| (0.024) | (0.079) | (0.023) | (0.077) | |
| Constant | 6.703*** | 12.003*** | ||
| (0.881) | (2.918) | |||
| Observations | 161 | 161 | 161 | 161 |
| R2 | 0.883 | 0.715 | 0.618 | 0.323 |
| Adjusted R2 | 0.875 | 0.696 | 0.533 | 0.173 |
| Residual Std. Error (df = 150) | 0.208 | 0.689 | ||
| F Statistic | 113.377*** (df = 10; 150) | 37.679*** (df = 10; 150) | 21.154*** (df = 10; 131) | 6.252*** (df = 10; 131) |
| Note: | p<0.1; p<0.05; p<0.01 | |||