Rent data wrangling

Reloading the Fair Market Rent data

In an earlier lesson, we looked at how to extract ZIP-code-level Small-Area Fair Market Rent data from the U.S. Department of Housing and Urban Development, filter it for selected Rutherford County ZIP codes, and display the data in a table. Below is the code from that lesson, followed by the table the code produces.

Rerun the code. Doing so will ensure that the rent data will be available to to R on the computer you are using. Then we’ll look at some additional things we can do with the data using some of R’s data wrangling tools.

# Installing and loading required packages

if (!require("tidyverse"))
  install.packages("tidyverse")
if (!require("openxlsx"))
  install.packages("openxlsx")
if (!require("gtExtras"))
  install.packages("gtExtras")

library(tidyverse)
library(openxlsx)
library(gtExtras)

# Reading data from:
# https://www.huduser.gov/portal/datasets/fmr/smallarea/index.html#year2024

FMR <- read.xlsx(
  "https://www.huduser.gov/portal/datasets/fmr/fmr2024/fy2024_safmrs_revised.xlsx",
  sheet = 1)

# Making a list of Rutherford County ZIP codes

ZIPList <- c(
  "37127",
  "37128",
  "37129",
  "37130",
  "37132",
  "37085",
  "37118",
  "37149",
  "37037",
  "37153",
  "37167",
  "37086")

# Filtering for Rutherford ZIP codes and
# selecting columns of interest

FMR_RuCo <- FMR %>%
  filter(ZIP.Code %in% ZIPList) %>%
  select(ZIP.Code, SAFMR.0BR, SAFMR.1BR, SAFMR.2BR, SAFMR.3BR, SAFMR.4BR) %>% 
  distinct()

# Renaming the columns

colnames(FMR_RuCo) <- c("ZIP", "Studio", "BR1", "BR2", "BR3", "BR4")

# Showing the data as a table

FMR_RuCo_table <- gt(FMR_RuCo) %>%
  tab_header("Rutherford FMR, by size and ZIP") %>%
  cols_align(align = "left") %>%
  gt_theme_538
FMR_RuCo_table

ZIP	Studio	BR1	BR2	BR3	BR4
Rutherford FMR, by size and ZIP
37037	1660	1710	1920	2410	2940
37085	1260	1290	1450	1820	2210
37086	1580	1620	1820	2290	2790
37118	1100	1130	1270	1590	1960
37127	1240	1270	1430	1800	2190
37128	1510	1550	1740	2190	2670
37129	1420	1460	1640	2060	2510
37130	1180	1210	1360	1710	2080
37132	1180	1210	1360	1710	2080
37149	1100	1130	1270	1590	1960
37153	1410	1450	1630	2040	2490
37167	1290	1330	1490	1870	2280

Mutating

Chances are, you already know how to produce an average. You just add the numbers you want to average, then divide by however many numbers there are, right? Easy. The mutate() function can average each ZIP code’s rents, then put the average in a new column we’ll call “ZIP_Average.”

The mutate() function is part of the dplyr package, which R loaded into memory when it ran the library(tidyverse) line in the code above. The dplyr package is part of the tidyverse package. Note that mutate() won’t work unless library(tidyverse) has been run sometime since you opened RStudio, and library(tidyverse) won’t work unless install.packages("tidyverse") has been run at at least one time on the computer you are using.

That’s a general principle to keep in mind. If a function is part of base R, then all you have to do to use the function is to open R Studio. But if, like mutate(), the function is part of a package, then you have to install the package once per computer, using install.packages() and load the package once per computing session, using library(), before you can use the function. It sounds like a pain, but it’s really not. In practice, you simply begin a script by installing, if needed, and loading all the packages required by the script. That’s what’s going on above, in the section of code marked # Installing and loading required packages. And there are just a handful of packages you’ll use routinely.

This code will use mutate() to calculate the average. Then, the code will regenerate the table with the new variable included. The table is shown below.

FMR_RuCo <- FMR_RuCo %>%
  mutate(ZIP_Average = (Studio + BR1 + BR2 + BR3 + BR4) / 5)

FMR_RuCo_table <- gt(FMR_RuCo) %>% 
  tab_header("Rutherford FMR, by size and ZIP") %>%
  cols_align(align = "left") %>%
  gt_theme_538
FMR_RuCo_table

ZIP	Studio	BR1	BR2	BR3	BR4	ZIP_Average
Rutherford FMR, by size and ZIP
37037	1660	1710	1920	2410	2940	2128
37085	1260	1290	1450	1820	2210	1606
37086	1580	1620	1820	2290	2790	2020
37118	1100	1130	1270	1590	1960	1410
37127	1240	1270	1430	1800	2190	1586
37128	1510	1550	1740	2190	2670	1932
37129	1420	1460	1640	2060	2510	1818
37130	1180	1210	1360	1710	2080	1508
37132	1180	1210	1360	1710	2080	1508
37149	1100	1130	1270	1590	1960	1410
37153	1410	1450	1630	2040	2490	1804
37167	1290	1330	1490	1870	2280	1652

Sorting, low to high

It’s not difficult to scan through the dozen average rents in the table and discern that two ZIP codes, 37118 and 37149, tie for the cheapest average, $1,410, while the 37037 ZIP code has the most expensive average, $2,128. Identifying these extremes would be a lot more difficult, though, if the table contained (as it will in future lessons in this series) all 80 ZIP codes across Rutherford, Davidson and Williamson counties.

So it would be handy to know how to arrange the ZIP codes by their ZIP_Average values, so that the ZIP_Average values get bigger as you go from the top of the table to the bottom of the table. This code will do exactly that, using the arrange() function from the dplyr package, then regenerate the table:

# Sorting, ascending order

FMR_RuCo <- FMR_RuCo %>%
  arrange(ZIP_Average)

FMR_RuCo_table <- gt(FMR_RuCo) %>% 
  tab_header("Rutherford FMR, by size and ZIP") %>%
  cols_align(align = "left") %>%
  gt_theme_538
FMR_RuCo_table

ZIP	Studio	BR1	BR2	BR3	BR4	ZIP_Average
Rutherford FMR, by size and ZIP
37118	1100	1130	1270	1590	1960	1410
37149	1100	1130	1270	1590	1960	1410
37130	1180	1210	1360	1710	2080	1508
37132	1180	1210	1360	1710	2080	1508
37127	1240	1270	1430	1800	2190	1586
37085	1260	1290	1450	1820	2210	1606
37167	1290	1330	1490	1870	2280	1652
37153	1410	1450	1630	2040	2490	1804
37129	1420	1460	1640	2060	2510	1818
37128	1510	1550	1740	2190	2670	1932
37086	1580	1620	1820	2290	2790	2020
37037	1660	1710	1920	2410	2940	2128

Sorting, high to low

See the advantage? Now it’s easy to tell which ZIP codes have the lowest and highest average rents. Furthermore, it’s easy to tell how any one ZIP code’s average rent compares to the average rents for all of the other ZIP codes.

What if you wanted to sort the table in the other direction - that is, with the highest ZIP_Average figure at the top and the lowest at the bottom? You could do so by situating the dplyr package’s desc() function inside the arrange() function, like this:

# Sorting, descending order

FMR_RuCo <- FMR_RuCo %>%
  arrange(desc(ZIP_Average))

FMR_RuCo_table <- gt(FMR_RuCo) %>% 
  tab_header("Rutherford FMR, by size and ZIP") %>%
  cols_align(align = "left") %>%
  gt_theme_538
FMR_RuCo_table

ZIP	Studio	BR1	BR2	BR3	BR4	ZIP_Average
Rutherford FMR, by size and ZIP
37037	1660	1710	1920	2410	2940	2128
37086	1580	1620	1820	2290	2790	2020
37128	1510	1550	1740	2190	2670	1932
37129	1420	1460	1640	2060	2510	1818
37153	1410	1450	1630	2040	2490	1804
37167	1290	1330	1490	1870	2280	1652
37085	1260	1290	1450	1820	2210	1606
37127	1240	1270	1430	1800	2190	1586
37130	1180	1210	1360	1710	2080	1508
37132	1180	1210	1360	1710	2080	1508
37118	1100	1130	1270	1590	1960	1410
37149	1100	1130	1270	1590	1960	1410

Recoding

In data analysis, “recoding” means creating a variable based on the values of one or more other variables. To illustrate, let’s make a variable called Rent_Category that indicates whether each ZIP code’s ZIP_Average is above, equal to, or below the average of all the ZIP_Average values.

Finding the average. The first task is to find the average of the ZIP_Average values. The mean() function, from base R, offers a handy shortcut. The code finds the average, stores it in a variable called Average_ZIP_Average, and displays the average’s value:

# Finding the average of the ZIP_Average values

Average_ZIP_Average <- mean(FMR_RuCo$ZIP_Average)
Average_ZIP_Average

## [1] 1698.5

Weird syntax. We’re working in base R, not dplyr, here, so the syntax is a little different than what you have seen so far. Here, the FMR_RuCo$ZIP_Average$ code let’s R know that the ZIP_Average variable is a column in the FMR_RuCo data frame. The $ separates the data frame name from the variable name.

Getting logical. With the average, 1698.5, stored in Average_ZIP_Average, we can use the case_when() function from the dplyr package to add the Rent_Category variable to the FMR_RuCo data frame and make the variable’s values depend on how each ZIP_Average value compares logically to the Average_ZIP_Average value.

By “logically,” I mean that we’re going to account for all possible ways that Zip_Average could compare to Average_ZIP_Average.

The next block of code will:

Set Rent_Category to “Above average” if Zip_Average is greater than Average_ZIP_Average.
Set Rent_Category to “Average” if Zip_Average equals Average_ZIP_Average. You can tell that’s not going to happen, because Average_ZIP_Average ends in .5, and all of the Zip_Average values are whole numbers with no decimals. But it’s wise to cover all of the logical bases. When you’re dealing with a ton of data rather than just a dozen rows, you never know what might be hiding in there.
Set Rent_Category to “Below average” if Zip_Average is less than Average_ZIP_Average.
Sent Rent_Category to “Error” if, for some reason, a Zip_Average value fits none of the above logical patterns. Here, again, we can tell at a glance that none of the dozen Zip_Average values will fall into this category. But if we were working with hundreds or thousands of Zip_Average values, at least a few might. For example, there could be some “NA” values, which is R’s default for indicating missing values. It’s simply good practice to cover all of the logical possibilities, and the “Error” category would let you easily find and deal with uncategorized Zip_Average values.

# Recoding 

FMR_RuCo <- FMR_RuCo %>%
  mutate(
    Rent_Category = case_when(
      ZIP_Average > Average_ZIP_Average ~ "Above average",
      ZIP_Average == Average_ZIP_Average ~ "Average",
      ZIP_Average < Average_ZIP_Average ~ "Below average",
      .default = "Error"))

FMR_RuCo_table <- gt(FMR_RuCo) %>% 
  tab_header("Rutherford FMR, by size and ZIP") %>%
  cols_align(align = "left") %>%
  gt_theme_538
FMR_RuCo_table

ZIP	Studio	BR1	BR2	BR3	BR4	ZIP_Average	Rent_Category
Rutherford FMR, by size and ZIP
37037	1660	1710	1920	2410	2940	2128	Above average
37086	1580	1620	1820	2290	2790	2020	Above average
37128	1510	1550	1740	2190	2670	1932	Above average
37129	1420	1460	1640	2060	2510	1818	Above average
37153	1410	1450	1630	2040	2490	1804	Above average
37167	1290	1330	1490	1870	2280	1652	Below average
37085	1260	1290	1450	1820	2210	1606	Below average
37127	1240	1270	1430	1800	2190	1586	Below average
37130	1180	1210	1360	1710	2080	1508	Below average
37132	1180	1210	1360	1710	2080	1508	Below average
37118	1100	1130	1270	1590	1960	1410	Below average
37149	1100	1130	1270	1590	1960	1410	Below average

Grouping and summarizing

Now that we have the ZIP codes simplified into ZIP codes with overall “Above average” and “Below average” rents, let’s put these categories to work.

Suppose you and two friends were hunting for a three-bedroom apartment to share. Suppose, also, that each of you had agreed to chip in $600 on each month’s rent, making your total budget 3 x $600 = $1,800 a month.

With only a dozen ZIP codes to look at in the table above, you and your friends could easily see that you probably should begin your search in the 37127 ZIP code, and maybe consider the 37130 and 37132 ZIP codes as well. But if you were searching among the 80 ZIP codes throughout Rutherford, Davidson and Williamson counties, you might appreciate a more efficient way to narrow your search.

Here’s a technique for doing that. Let’s pull the three-bedroom data into a separate data frame called Summary_BR3 and get the average, maximum and minimum three-bedroom rent for the “Above average” and “Below average” ZIP codes, along with the Count of how many ZIP codes are in each category.

Looking at those results would let you know whether you should begin your search among the “Above average” ZIP codes or the “Below average” ZIP codes - or whether your budget is too small for either category or big enough to let you shop in both categories.

# Grouping and summarizing

Summary_BR3 <- FMR_RuCo %>% 
  group_by(Rent_Category) %>% 
  summarize(Average = round(mean(BR3), digits = 0),
            Minimum = min(BR3),
            Maximum = max(BR3),
            Count = n())

Summary_BR3_table <- gt(Summary_BR3) %>% 
  tab_header("Three-bedroom stats, by rent category") %>%
  cols_align(align = "left") %>%
  gt_theme_538
Summary_BR3_table

Rent_Category	Average	Minimum	Maximum	Count
Three-bedroom stats, by rent category
Above average	2198	2040	2410	5
Below average	1727	1590	1870	7

A glance at the summary table tells you that you and your two friends likely will be apartment hunting exclusively in the seven “Below average” rent ZIP codes, where average rents range from $1,590 to $1,870 and average $1,727. You’ll probably have fewer choices in the ZIP code with the $1,870 average rent. But the figure is an average, which means at least some of the available apartments should go for less than $1,870.

Meanwhile, it looks like all, or at least most, of the five “Above average” ZIP codes will be out of your price range. Even the lowest average, $2,198, is well above your budget. Again, the figure is an average. You might get lucky and find an $1,800 apartment in that ZIP code. But it would be a long shot.

Of course, you’ll still have to look at the full table to get the actual ZIP codes that fit your budget. But the summary table can let you quickly narrow your search.

Saving your data

You might want to work with this data again at a later time. If so, you could simply rerun the code. Doing so would redownload the raw data and quickly repeat all of the work you have done. But all the same, you might want to save the analyzed data to your computer’s hard drive so you can simply load it from there.

Here’s how you would save the data to your computer using the write_csv() function from the readr package. The readr package is part of the tidyverse package, so you loaded readr into memory when you ran the library(tidyverse) code way back at the start of the script.

The code will save the FMR_RuCo data frame to your hard drive, in the same folder as this script. The file will be called FMR_RuCo.csv.” You’ll see the file show up under the “Files” tab on the right side of the screen.

You could name the file anything you like, but using a name that repeats the data frame name can make it easier to remember what’s what later on.

The .csv at the end of the file name signals that the file is in comma-separated value format. That means that, in the file, commas indicate where the columns begin and end. You could save the file in other formats, but .csv format is nearly universal.

write_csv(FMR_RuCo, "FMR_RuCo.csv")

Reading your data back into an R data frame

It’s easy to reverse the process and read your .csv file back into an R data frame. Just to illustrate how it’s done, here is code that will read the FMR_RuCo.csv file into an R data frame called Retrieved_Data:

Retrieved_Data <- read_csv("FMR_RuCo.csv")