I’ve been doing some reading and I read that Quarto is designed to be used as a modern data science lab notebook, so here’s me trying to do that. My goal here is to create a raster surface that contains values interpolated from a layer of points describing rent in various locations. I’ve obtained a Zillow data set that suits my needs and provides values for 707 metropolitan statistical areas in the United States. Hopefully, that is enough to generate a meaningful interpolation. If not, I will look into using each individual rental as a point, scraping the data from Zillow in real time. Eventually, this is likely the path I would like to follow to make the data as current as possible at all times.
# reads in my data from zillow's websiterentalData <-read_csv("https://files.zillowstatic.com/research/public_csvs/zori/Metro_zori_uc_sfrcondomfr_sm_month.csv?t=1765493039")
A Note on ZORI
ZORI stands for Zillow Observed Rent Index. This is defined as:
A smoothed measure of the typical observed market rate rent across a given region. ZORI is a repeat-rent index that is weighted to the rental housing stock to ensure representativeness across the entire market, not just those homes currently listed for-rent. The index is dollar-denominated by computing the mean of listed rents that fall into the 35th to 65th percentile range for all homes and apartments in a given region, which is weighted to reflect the rental housing stock. - Zillow
While this may not capture the full breadth of affordability available in an area, it should give me enough to create a basic model. As stated, my eventual preferred option is to interpolate from individual rental listings scraped for the entire country on a daily basis. Before investing in that resource-intensive model, I wish to proof-of-concept this map with the easily available data Zillow has provided.
A quick internet search revealed that the problem I identified—adding lat and long to my data—is called geocoding, something I vaguely remember talking about in one of my GIS classes. Some further reading revealed that there are several ways to do this, but I’ve decided to go with the tidygeocoding package.
# Separates the city from the state and deletes the state from RegionName (there is already a StateName column)clRentalData <- rentalData |>mutate(RegionName =str_split_i(rentalData$RegionName, ",", 1) |>str_remove_all(",") )# Sets my api key so I can batch geocode through geocodio (2500 a day free.)Sys.setenv(GEOCODIO_API_KEY ="f88969c36e8683dd391ec6e1d626cf31f881999")# Does the batch geocoding and adds Lat/Long columns to my datageoData <- clRentalData |>geocode(city = RegionName, state = StateName, method ="geocodio", lat ="Lat", long ="Long", verbose = T)# Saves this data to a csv so I don't keep having to use the geocoding service.write_csv(geoData, "geoData.csv")
I don’t want to keep using this geocoding service since it costs money after 2,500 entries a day, so I’ve set the block above not to evaluate and instead I am substituting the block below to load the geocoded data from where I saved it.
# Reads in the saved data generated when the last step was performed.geoData <-read_csv("geoData.csv")
And now for an accuracy test. . .
# Generate a leaflet map to check the alignment of the geocoding against an Esri basemapleaflet(geoData) |>addProviderTiles("Esri.WorldStreetMap") |>setView(lng =-80, lat =40, zoom =7) |>addMarkers(lng =~Long,lat =~Lat,label =~RegionName)
I noticed while looking at this map that there is a pin for “United States.” This is wrong. “United States” is the very first value in the data set (SizeRank == 0), and just provides a national average. I do not need this and it doesn’t fit the format of my data, so I removed it. I may end up wanting that value later for something, so it is good that it remains in other data sets above.
# Removes that observation.geoData <- geoData |>filter(RegionName !="United States") |>rename_with(.fn =~paste0("X", str_replace_all(.x, "-", "_"))) # added later
Interpolation
Now that I have the data I need to generate points, I can interpolate from those points. Thankfully, I have Robert J. Hijman’s fabulous Spatial Data Analysis with R to guide me. So, the following code is basically me using his chapter on interpolation as a tutorial but with my own data.
# all credit for this function goes to Hijman, although I understand what is happeningRMSE <-function(observed, predicted) {sqrt(mean((predicted - observed)^2, na.rm = T))}# This will be used later to evaluate the models.
Now, I believe I need to turn my data into a vector layer.
# Turn my data into a vector of pointsvectData <-vect(geoData, c("XLong", "XLat"), crs ="EPSG:4326")# grab a U.S. shapefile I happen to have and vectorize that as well for croppingvectUSA <-vect("C:/Users/Luke/OneDrive/Documents/Fireweed Cartography/Shapefiles/Census.gov/cb_2018_us_state_500k/cb_2018_us_state_500k.shp", crs ="EPSG:3857")
Proximity Polygon Interpolation
As always, standing on the shoulders of giants. I have only the vaguest idea what a Voronoi is. Aaaand after a quick google search, now I understand. A Voronoi diagram is a diagram that breaks an area scattered with nodes up into polygons such that all hypothetical points within an area are closer to the node in that area than any other node.
# Create the voroni mapvoronoiData <-voronoi(vectData)# crop it to the united statesvoronoiDataUSA <-crop(voronoiData, vectUSA)#plot the map symbolized by the most recent rent dataplot(voronoiDataUSA, "X2025_11_30")
Next step is to rasterize this data.
r <-rast(ext(voronoiDataUSA), res =0.05, crs =crs(voronoiDataUSA))rasterData <-rasterize(voronoiDataUSA, r, field ="X2025_11_30")plot(rasterData)
Now, I’m going to attempt Hijman’s 5-fold cross-validation method to calculate RMSE for this interpolation. 5-fold cross-validation is a specific type of k-fold validation where k = 5. It generates rmse estimates by breaking the data into 5 groups or folds, using 4 of the folds to train the model and the 5th to test it. This cycles through until all 5 folds have been used to test against the other 4.
# Creates an index the same length as the data from which to assign the data to foldsset.seed(12)kf <-sample(1:5, nrow(vectData), replace = T)# Generates an empty vector to store the rmse for each fold inrmse <-rep(NA, 5)for (k in1:5){ test <- vectData[kf == k, ] #selects the test fold train <- vectData[kf != k, ] #selects the train fold v <-voronoi(train) #generates a voronoi diagram from the training set p <-extract(v, test) #extracts predictions for the test set from the voronoi diag rmse[k] <-RMSE(test$X2025_11_30, p$X2025_11_30) # tests the predicted values against the observed.}rmse
[1] 371.1283 382.2459 451.3659 346.0828 404.6336
ProxRMSE <-mean(rmse)ProxRMSE
[1] 391.0913
That’s great, appears to work. I read that a 10-fold cross-validation can be good for small datasets, so lets see what I get if I modify this to do that. (Lets also see if I understand what’s happening well enough to modify it)
# changed 1:5 to 1:10set.seed(12)kf10 <-sample(1:10, nrow(vectData), replace = T)# Changed vector to be 10 elements longrmse10 <-rep(NA, 10)# changed for loop to iterate across 1:10for (k in1:10){ test10 <- vectData[kf10 == k, ] train10 <- vectData[kf10 != k, ] v10 <-voronoi(train10) p10 <-extract(v10, test10) rmse10[k] <-RMSE(test10$X2025_11_30, p10$X2025_11_30)}rmse10
This is all wonderful, but what, exactly, does that RMSE mean? Well, we need a null model to compare it against. In this case, we can use the national mean ZORI as our null model. Therefore, to generate the null I can write:
Can we do better than this? Let’s try some other types of interpolation.
Nearest Neighbor Interpolation
d <-data.frame(geom(vectData)[, c("x", "y")], as.data.frame(vectData))# Creates a geostatistical model to use with the interpolate() functiongsNNI <-gstat(formula = X2025_11_30 ~1, # designates simple krieging on this variablelocations =~x+y,data = d,nmax =5, # max number of neighbors to consider for kriegingset =list(idp =0) # inverse distance power = 0 means all 5 neighbors equally weighted )# interpolates to a raster based on the spatial model# **Very large operation/file**NNI <-interpolate(r, gsNNI, debug.level =0)NNIMask <-mask(NNI, rasterData)plot(NNIMask, 1)
I wonder if leaflet will let me have a closer look at this map. . .
leaflet() |>addRasterImage(NNIMask) |>setView(lng =-100, lat =38, zoom =3)
Warning: using the first layer in 'x'
How about that. This is certainly a more visually appealing interpolation, but is it more accurate as well? Let’s do another 10-fold cross-validation and pray it doesn’t blow up my computer.
Not too shabby! That’s a sizable gain over the Proximity Polygons method, although I would still like to do better. Next up is Inverse Distance Weighting.
Inverse Distance Weighting
Unlike the last method, this method allows for weights to be determined by proximity. This is perhaps the most common way to do interpolation, since it parallels Toebler’s law of geography that near things are more alike than far things. As such, I expect that this interpolation will be even more accurate than the last, but we shall see.
gsIDW <-gstat(formula = X2025_11_30 ~1, locations =~x + y, data = d)IDW <-interpolate(r, gsIDW, debug.level =0)IDWMask <-mask(IDW, rasterData)plot(IDWMask, 1)
I am feeling less than optimistic for this one after seeing the raster. This is not the map I pictured in my head, but let’s see how it does on the RMSE scores. It might be worth noting that IDW fixed the artifacts that appeared in Alaska with NNI.
Not as good as NNI, but definitely better than the null and better than proximity polygons.
Geostatistical Krieging
The basis of krieging is the concept of the variogram. A variogram is a plot of the relationship between distance and variability for each point and essentially gives us an estimate of how quickly the spatiality of the data becomes irrelevant.
# Creates model (same as)gsKrieg <-gstat(formula = X2025_11_30~1, locations =~x + y, data = d)# Generates variogram from modelv <-variogram(gsKrieg)plot(v)
This variogram looks either linear or slightly gaussian. Let’s try to fit a gaussian model to it.
# Fits a model variogramvarModel <-fit.variogram(v, vgm("Gau"))plot(v, varModel)
Wow! That fits pretty nice. I’m going to use this model.
gsKriegModel <-gstat(formula = X2025_11_30 ~1, locations =~x + y, data = d, model = varModel,nmax =10)krieg <-interpolate(r, gsKriegModel, debug.level =0)kriegMask <-mask(krieg, rasterData)plot(kriegMask)
I had to make a number of parameter adjustments to get this interpolation to run. First, I changed the cell size in the receiving raster from 0.01 degrees lon/lat to 0.05 degrees lon/lat to reduce computational and storage requirements (It was taking 10+ minutes to run). Second, I added in an nmax argument to the geostatistical model which tells the computer to only consider the 10 nearest points when doing interpolation. Without nmax specified, the program ran infinitely at my desired resolution. As a sanity check, I made the raster cell sizes a full degree and interpolated that surface with no nmax. At that level, the computation actually finished, but was essentially worthless. nmax = 10 seems to be a happy medium.
With all that in mind, lets test this model’s RMSE (and let’s keep half an eye on the idea of “optimization” as we look at the results from this model).
So, this model is in third place among the tested models so far, just behind NNI. But what if we optimized it?
Optimization
When interpolating through krieging, there are two variables that can be optimized: distance decay (idp) and number of neighbors (nmax). Our goal in optimization is to find the combination of those two numbers that produces the lowest RMSE. To do this, we will follow and take inspiration from the function developed by Hijman.
List of 5
$ par : num [1:2] 6.26 2.73
$ value : num 351
$ counts : Named int [1:2] 37 NA
..- attr(*, "names")= chr [1:2] "function" "gradient"
$ convergence: int 0
$ message : NULL
The optimized values can be found under ‘par’ here, with 6.26 being the optimal nmax and 2.73 being the optimal idp. Mapped, that looks like:
gsOpt <-gstat(formula = X2025_11_30~1, locations =~x + y, data = d, nmax = opt$par[1], set =list(idp = opt$par[2]))optMap <-interpolate(r, gsOpt, debug.level =0)optMapMask <-mask(optMap, rasterData)plot(optMapMask)
Wowow-wewow! That’s the most accurate yet, and by a significant margin, too. Moreover, this map is beginning to match what my instincts tell me about how property values are laid out. Time for a closer look.