EDA Neighborhood - Air Quality - Water Quality

Research Question and Policy Problem

What is the market valuation for outdoor clean air in Medellín, as revealed by housing prices in a hedonic property model?

The policy problem is: How much are households willing to pay for better outdoor air quality, as revealed by housing prices?

Exploratory Data on Neighborhood Data

Let’s start with a simple histogram to see if there are outliers or the distribution approaches normality. This data is current. I have a meeting this week with the Medellin Catastro, to discuss whether they will be able to provide several years of valuations. But in the meantime, I am doing everything based on 2018, which is a year I know there is data for all sources.

ggplot(data=hood, aes(x=log(SumaDeAvaluoTotal))) + geom_histogram(col="dark green", 
                 fill="dark green", 
                 alpha = .5) +
  labs(x = "Valuation Sum", y = "Proportion", 
       title = "Sum of Valuation per Neighborhood in Medellin, 2018", fill = "Comuna",
       caption = "Medellín Paper EDA, Rojas 2021")

Let’s group them by comuna and observe their variation in a box plot.

options(scipen = 100)
  ggplot(data=hood, aes(x=nomComuna, y=SumaDeAvaluoTotal/1000000, fill=nomComuna)) +
  geom_boxplot() + xlab("District") + ylab("Total District Valuation") + 
   coord_flip() + theme(legend.position="none")

There is enough variation, but this is in absolute terms. This can be adjusted by population size and in log mode. More on that in a bit. For now, now let’s check the level of rent per neighborhood. This variable is a composite from the Household Survey, which is how much they are paying of rent, and if they aren’t, how much they think they should pay. It is a mix of an objective and subjective valuation, which has a lot of methodological aspects which could be discussed. I also do some cleaning to make the barrio names match.

ggplot(data=merged_hood_ecv, aes(x=nomComuna, y=ave_valuation, fill=nomComuna)) +
  geom_boxplot() + xlab("District") + ylab("Average District Valuation") + 
   coord_flip() + theme(legend.position="none")

###USING 2019 PM25 NEED TO FIX THIS

We can also compare the valuations and the levels of rent in the different Neighborhoods. We would expect an increasing relationship between both variables, which is what we see in this plot:

merged_hood_ecv %>% filter(log(rental) > 12) %>%
      ggplot(aes(x=log(rental), y=log(SumaDeAvaluoTotal))) +
         geom_point(aes(col = nomComuna, 
                 alpha = .8)) +
  labs(x = "Rental Price (log)", y = "Sum of Valuation (log)", 
       title = "Household Valuation and Rental Prices per Neighborhood, Medellin, 2018", fill = "Comuna", subtitle= "Grouped by Comuna", 
       caption = "Medellín Paper EDA, Rojas 2021") + theme(legend.position="none")

Here I do the same but rather than the net sum of valuation, I divide the valuation over the amount of households, to get an average valuation. It changes many of the observations, but not the overall direction of the relationship.

merged_hood_ecv %>% filter(log(rental) > 12) %>%
      ggplot(aes(x=log(rental), y=log(ave_valuation))) +
         geom_point(aes(col = nomComuna, 
                 alpha = .8)) +
  labs(x = "Rental Price (log)", y = "Average of Valuation (log)", 
       title = "Household Valuation and Rental Prices per Neighborhood, Medellin, 2018", fill = "Comuna", subtitle= "Grouped by Comuna", 
       caption = "Medellín Paper EDA, Rojas 2021") + theme(legend.position="none")

Air Quality and Neighborhood Valuations

Now it’s time to start observing the neighborhood prices, and its relationship with Air Quality Stations. Here, I spatially match each station to its closest neighborhood. I am using 2018 average for each PM2.5 value for the AQ stations. In total, only 12 stations end up matching, but there is still some variation between those.

merged_hood_ecv %>% filter(log(rental) > 12) %>%
      ggplot(aes(y=meanpm25, x=log(ave_valuation))) +
         geom_point(aes(col = nomComuna, 
                 alpha = .8)) +
  labs(y = "Mean PM25", x = "Average of Valuation (log)", 
       title = "Household Valuation and Mean PM2.5 per Neighborhood", fill = "Comuna", subtitle= "Grouped by Comuna", 
       caption = "Medellín Paper EDA, Rojas 2021") + theme(legend.position="none")

Precipitation Measurements

Carriazo 2018, suggests using Precipitation and Wind as instruments in housing markets. So, there are 796 stations for precipitation, and for wind. Most of them, also have match per neighborhood, so there will be a lot of variation in those data, and they will be much easier to match with neighborhoods. Where I am only able to match 12 AQ stations to 160 neighborhoods in Medellín, I might be able to find one station per 1 or 2 neighborhoods, which is a much much better measurement.

However, downloading those data has to be done manually, per month. That means that there are 12 files per air quality station, per year. That’s 9552 files I have to download manually, per year (if I get better yearly valuations, that gets multiplied per each additional year I’m able to get). So, for illustration purposes, I downloaded a sample of 10 water stations, and there is enough variation where this data could be used, if there’s an analytical plan, but it’s impractical to download it all for EDA purposes.

We can observe some variation in the precipitation measurements:

rain_means <- read.csv("/Users/alfonso/Google Drive/Data Colombia/Medellin/rmean_2018.csv")
ggplot(data=rain_means, aes(wat_q)) + geom_density(col="dark blue", 
                 fill="dark blue", 
                 alpha = .5) +
  labs(y = "Density", x = "Water Quality Value", 
       title = "Water Quality for 10-station Sample in Medellin, 2018",
      caption = "Medellín Paper EDA, Rojas 2021")

Household Survey Variables

We also have the Household Survey variables, which can be used in the model. I have those Household Surveys for every year from 2008 until 2020, so the restriction there is the Air Quality Data. A correlation plot for all of the variables provides some interesting (but expected) results:

for_corr <- merged_hood_ecv %>% 
  select(-c(codbarrio, estacion, nombre_bar, join_estacion, count, 
            tot_pop, year, nomComuna, codComuna, DS_BARRIO, pop))
corrmat <- cor(for_corr, use="complete.obs")
corrplot(corrmat)

## Exploratory Predictions

Now let’s run a couple of Exploratory Regressions.

model1 <- lm(formula = log(rental) ~ meanpm25 + female_prop + ave_pop + log(ave_income) + ave_hhsize + ave_age + ave_tenurelength +
               ave_edlevel + ave_commtime + ave_rooms, data = merged_hood_ecv)
model2 <- lm(formula = log(ave_valuation) ~ meanpm25 + female_prop + ave_pop + log(ave_income) +
             ave_hhsize + ave_age + ave_tenurelength +
               ave_edlevel + ave_commtime + ave_rooms, data = merged_hood_ecv)
                  
model3 <- plm(formula = log(rental) ~ meanpm25 + female_prop + ave_pop + log(ave_income) + ave_hhsize + ave_age + ave_tenurelength +
                ave_edlevel + ave_commtime + ave_rooms, data = merged_hood_ecv, model="within", index=c("codComuna")) 

model4 <- plm(formula = log(ave_valuation) ~ meanpm25 + female_prop + ave_pop + ave_hhsize + ave_age + ave_tenurelength +
                ave_edlevel + ave_commtime + ave_rooms + log(ave_income), data = merged_hood_ecv, model="within", index=c("codComuna")) 
stargazer(model1, model2, model3, model4, type = 'html')
Dependent variable:
log(rental) log(ave_valuation) log(rental) log(ave_valuation)
OLS OLS panel panel
linear linear
(1) (2) (3) (4)
meanpm25 0.010 0.044** 0.012 0.016
(0.007) (0.022) (0.011) (0.035)
female_prop 0.041 -0.526 0.036 -0.535
(0.373) (1.234) (0.334) (1.108)
ave_pop 0.944 -36.175* 2.287 -21.433
(5.951) (19.707) (5.965) (19.819)
log(ave_income) 0.303*** 0.392* 0.180*** 0.138
(0.063) (0.208) (0.066) (0.219)
ave_hhsize 0.135*** 0.291** 0.087** 0.284**
(0.036) (0.120) (0.036) (0.118)
ave_age 0.012** 0.063*** 0.006 0.055***
(0.005) (0.017) (0.005) (0.017)
ave_tenurelength -0.003 -0.025** -0.002 -0.005
(0.003) (0.010) (0.003) (0.010)
ave_edlevel 0.259*** 0.512*** 0.270*** 0.324**
(0.040) (0.134) (0.040) (0.131)
ave_commtime -0.007*** -0.014* -0.006** -0.010
(0.003) (0.008) (0.003) (0.009)
ave_rooms 0.119*** 0.068 0.108*** 0.131*
(0.024) (0.079) (0.023) (0.077)
Constant 6.703*** 12.003***
(0.881) (2.918)
Observations 161 161 161 161
R2 0.883 0.715 0.618 0.323
Adjusted R2 0.875 0.696 0.533 0.173
Residual Std. Error (df = 150) 0.208 0.689
F Statistic 113.377*** (df = 10; 150) 37.679*** (df = 10; 150) 21.154*** (df = 10; 131) 6.252*** (df = 10; 131)
Note: p<0.1; p<0.05; p<0.01

Analytical Plan

  1. I need to figure out if the Water Quality and Wind data is worth using.
  2. If the WQ and Wind data is usable, then organize the two stage model for estimation. My initial thought is it would regress PM2.5 to WQ and Wind, and then something like the regression I already have with the estimated PM2.5.
  3. Given the availability on data, consider whether there is some approach where I use Water and Wind (or other meteorological data) more aggresively. This would imply switching the focus of the paper, so I’d like Andrew’s feedback on whether that is a good idea.
  4. Robustness Checks (???)

Next Steps for the month

  1. Once we agree on which one I should use, get the data - 1-2 weeks
  2. Run agreed models - 1-2 weeks
  3. Start robustness checks (which ones???) - final week (Mid-October