In an earlier lesson, we looked at how to extract ZIP-code-level Small-Area Fair Market Rent data from the U.S. Department of Housing and Urban Development, filter it for selected Rutherford County ZIP codes, and display the data in a table. Below is the code from that lesson, followed by the table the code produces.
Rerun the code. Doing so will ensure that the rent data will be available to to R on the computer you are using. Then we’ll look at some additional things we can do with the data using some of R’s data wrangling tools.
# Installing and loading required packages
if (!require("tidyverse"))
install.packages("tidyverse")
if (!require("openxlsx"))
install.packages("openxlsx")
if (!require("gtExtras"))
install.packages("gtExtras")
library(tidyverse)
library(openxlsx)
library(gtExtras)
# Reading data from:
# https://www.huduser.gov/portal/datasets/fmr/smallarea/index.html#year2024
FMR <- read.xlsx(
"https://www.huduser.gov/portal/datasets/fmr/fmr2024/fy2024_safmrs_revised.xlsx",
sheet = 1)
# Making a list of Rutherford County ZIP codes
ZIPList <- c(
"37127",
"37128",
"37129",
"37130",
"37132",
"37085",
"37118",
"37149",
"37037",
"37153",
"37167",
"37086")
# Filtering for Rutherford ZIP codes and
# selecting columns of interest
FMR_RuCo <- FMR %>%
filter(ZIP.Code %in% ZIPList) %>%
select(ZIP.Code, SAFMR.0BR, SAFMR.1BR, SAFMR.2BR, SAFMR.3BR, SAFMR.4BR) %>%
distinct()
# Renaming the columns
colnames(FMR_RuCo) <- c("ZIP", "Studio", "BR1", "BR2", "BR3", "BR4")
# Showing the data as a table
FMR_RuCo_table <- gt(FMR_RuCo) %>%
tab_header("Rutherford FMR, by size and ZIP") %>%
cols_align(align = "left") %>%
gt_theme_538
FMR_RuCo_table
Rutherford FMR, by size and ZIP | |||||
ZIP | Studio | BR1 | BR2 | BR3 | BR4 |
---|---|---|---|---|---|
37037 | 1660 | 1710 | 1920 | 2410 | 2940 |
37085 | 1260 | 1290 | 1450 | 1820 | 2210 |
37086 | 1580 | 1620 | 1820 | 2290 | 2790 |
37118 | 1100 | 1130 | 1270 | 1590 | 1960 |
37127 | 1240 | 1270 | 1430 | 1800 | 2190 |
37128 | 1510 | 1550 | 1740 | 2190 | 2670 |
37129 | 1420 | 1460 | 1640 | 2060 | 2510 |
37130 | 1180 | 1210 | 1360 | 1710 | 2080 |
37132 | 1180 | 1210 | 1360 | 1710 | 2080 |
37149 | 1100 | 1130 | 1270 | 1590 | 1960 |
37153 | 1410 | 1450 | 1630 | 2040 | 2490 |
37167 | 1290 | 1330 | 1490 | 1870 | 2280 |
Chances are, you already know how to produce an average. You just add
the numbers you want to average, then divide by however many numbers
there are, right? Easy. The mutate()
function can average each ZIP
code’s rents, then put the average in a new column we’ll call
“ZIP_Average.”
The mutate()
function is part of the dplyr package,
which R loaded into memory when it ran the
library(tidyverse)
line in the code above. The dplyr
package is part of the tidyverse package. Note that
mutate()
won’t work unless library(tidyverse)
has been run sometime since you opened RStudio, and
library(tidyverse)
won’t work unless
install.packages("tidyverse")
has been run at at least one
time on the computer you are using.
That’s a general principle to keep in mind. If a function is part of
base R, then all you have to do to use the function is to open R Studio.
But if, like mutate()
, the function is part of a package,
then you have to install the package once per computer, using
install.packages()
and load the package once per computing
session, using library()
, before you can use the function.
It sounds like a pain, but it’s really not. In practice, you simply
begin a script by installing, if needed, and loading all the packages
required by the script. That’s what’s going on above, in the section of
code marked # Installing and loading required packages
. And
there are just a handful of packages you’ll use routinely.
This code will use mutate()
to calculate the average.
Then, the code will regenerate the table with the new variable included.
The table is shown below.
FMR_RuCo <- FMR_RuCo %>%
mutate(ZIP_Average = (Studio + BR1 + BR2 + BR3 + BR4) / 5)
FMR_RuCo_table <- gt(FMR_RuCo) %>%
tab_header("Rutherford FMR, by size and ZIP") %>%
cols_align(align = "left") %>%
gt_theme_538
FMR_RuCo_table
Rutherford FMR, by size and ZIP | ||||||
ZIP | Studio | BR1 | BR2 | BR3 | BR4 | ZIP_Average |
---|---|---|---|---|---|---|
37037 | 1660 | 1710 | 1920 | 2410 | 2940 | 2128 |
37085 | 1260 | 1290 | 1450 | 1820 | 2210 | 1606 |
37086 | 1580 | 1620 | 1820 | 2290 | 2790 | 2020 |
37118 | 1100 | 1130 | 1270 | 1590 | 1960 | 1410 |
37127 | 1240 | 1270 | 1430 | 1800 | 2190 | 1586 |
37128 | 1510 | 1550 | 1740 | 2190 | 2670 | 1932 |
37129 | 1420 | 1460 | 1640 | 2060 | 2510 | 1818 |
37130 | 1180 | 1210 | 1360 | 1710 | 2080 | 1508 |
37132 | 1180 | 1210 | 1360 | 1710 | 2080 | 1508 |
37149 | 1100 | 1130 | 1270 | 1590 | 1960 | 1410 |
37153 | 1410 | 1450 | 1630 | 2040 | 2490 | 1804 |
37167 | 1290 | 1330 | 1490 | 1870 | 2280 | 1652 |
It’s not difficult to scan through the dozen average rents in the table and discern that two ZIP codes, 37118 and 37149, tie for the cheapest average, $1,410, while the 37037 ZIP code has the most expensive average, $2,128. Identifying these extremes would be a lot more difficult, though, if the table contained (as it will in future lessons in this series) all 80 ZIP codes across Rutherford, Davidson and Williamson counties.
So it would be handy to know how to arrange the ZIP codes by their
ZIP_Average values, so that the ZIP_Average values get bigger as you go
from the top of the table to the bottom of the table. This code will do
exactly that, using the arrange()
function from the dplyr
package, then regenerate the table:
# Sorting, ascending order
FMR_RuCo <- FMR_RuCo %>%
arrange(ZIP_Average)
FMR_RuCo_table <- gt(FMR_RuCo) %>%
tab_header("Rutherford FMR, by size and ZIP") %>%
cols_align(align = "left") %>%
gt_theme_538
FMR_RuCo_table
Rutherford FMR, by size and ZIP | ||||||
ZIP | Studio | BR1 | BR2 | BR3 | BR4 | ZIP_Average |
---|---|---|---|---|---|---|
37118 | 1100 | 1130 | 1270 | 1590 | 1960 | 1410 |
37149 | 1100 | 1130 | 1270 | 1590 | 1960 | 1410 |
37130 | 1180 | 1210 | 1360 | 1710 | 2080 | 1508 |
37132 | 1180 | 1210 | 1360 | 1710 | 2080 | 1508 |
37127 | 1240 | 1270 | 1430 | 1800 | 2190 | 1586 |
37085 | 1260 | 1290 | 1450 | 1820 | 2210 | 1606 |
37167 | 1290 | 1330 | 1490 | 1870 | 2280 | 1652 |
37153 | 1410 | 1450 | 1630 | 2040 | 2490 | 1804 |
37129 | 1420 | 1460 | 1640 | 2060 | 2510 | 1818 |
37128 | 1510 | 1550 | 1740 | 2190 | 2670 | 1932 |
37086 | 1580 | 1620 | 1820 | 2290 | 2790 | 2020 |
37037 | 1660 | 1710 | 1920 | 2410 | 2940 | 2128 |
See the advantage? Now it’s easy to tell which ZIP codes have the lowest and highest average rents. Furthermore, it’s easy to tell how any one ZIP code’s average rent compares to the average rents for all of the other ZIP codes.
What if you wanted to sort the table in the other direction - that is, with the highest ZIP_Average figure at the top and the lowest at the bottom? You could do so by situating the dplyr package’s desc() function inside the arrange() function, like this:
# Sorting, descending order
FMR_RuCo <- FMR_RuCo %>%
arrange(desc(ZIP_Average))
FMR_RuCo_table <- gt(FMR_RuCo) %>%
tab_header("Rutherford FMR, by size and ZIP") %>%
cols_align(align = "left") %>%
gt_theme_538
FMR_RuCo_table
Rutherford FMR, by size and ZIP | ||||||
ZIP | Studio | BR1 | BR2 | BR3 | BR4 | ZIP_Average |
---|---|---|---|---|---|---|
37037 | 1660 | 1710 | 1920 | 2410 | 2940 | 2128 |
37086 | 1580 | 1620 | 1820 | 2290 | 2790 | 2020 |
37128 | 1510 | 1550 | 1740 | 2190 | 2670 | 1932 |
37129 | 1420 | 1460 | 1640 | 2060 | 2510 | 1818 |
37153 | 1410 | 1450 | 1630 | 2040 | 2490 | 1804 |
37167 | 1290 | 1330 | 1490 | 1870 | 2280 | 1652 |
37085 | 1260 | 1290 | 1450 | 1820 | 2210 | 1606 |
37127 | 1240 | 1270 | 1430 | 1800 | 2190 | 1586 |
37130 | 1180 | 1210 | 1360 | 1710 | 2080 | 1508 |
37132 | 1180 | 1210 | 1360 | 1710 | 2080 | 1508 |
37118 | 1100 | 1130 | 1270 | 1590 | 1960 | 1410 |
37149 | 1100 | 1130 | 1270 | 1590 | 1960 | 1410 |
In data analysis, “recoding” means creating a variable based on the values of one or more other variables. To illustrate, let’s make a variable called Rent_Category that indicates whether each ZIP code’s ZIP_Average is above, equal to, or below the average of all the ZIP_Average values.
Finding the average. The first task is to find the
average of the ZIP_Average values. The mean()
function, from base R, offers a
handy shortcut. The code finds the average, stores it in a variable
called Average_ZIP_Average, and displays the average’s value:
# Finding the average of the ZIP_Average values
Average_ZIP_Average <- mean(FMR_RuCo$ZIP_Average)
Average_ZIP_Average
## [1] 1698.5
Weird syntax. We’re working in base R, not dplyr,
here, so the syntax is a little different than what you have seen so
far. Here, the FMR_RuCo$ZIP_Average$
code let’s R know that
the ZIP_Average
variable is a column in the
FMR_RuCo
data frame. The $
separates the data
frame name from the variable name.
Getting logical. With the average, 1698.5, stored in Average_ZIP_Average, we can use the case_when() function from the dplyr package to add the Rent_Category variable to the FMR_RuCo data frame and make the variable’s values depend on how each ZIP_Average value compares logically to the Average_ZIP_Average value.
By “logically,” I mean that we’re going to account for all possible ways that Zip_Average could compare to Average_ZIP_Average.
The next block of code will:
Set Rent_Category to “Above average” if Zip_Average is greater than Average_ZIP_Average.
Set Rent_Category to “Average” if Zip_Average equals Average_ZIP_Average. You can tell that’s not going to happen, because Average_ZIP_Average ends in .5, and all of the Zip_Average values are whole numbers with no decimals. But it’s wise to cover all of the logical bases. When you’re dealing with a ton of data rather than just a dozen rows, you never know what might be hiding in there.
Set Rent_Category to “Below average” if Zip_Average is less than Average_ZIP_Average.
Sent Rent_Category to “Error” if, for some reason, a Zip_Average value fits none of the above logical patterns. Here, again, we can tell at a glance that none of the dozen Zip_Average values will fall into this category. But if we were working with hundreds or thousands of Zip_Average values, at least a few might. For example, there could be some “NA” values, which is R’s default for indicating missing values. It’s simply good practice to cover all of the logical possibilities, and the “Error” category would let you easily find and deal with uncategorized Zip_Average values.
# Recoding
FMR_RuCo <- FMR_RuCo %>%
mutate(
Rent_Category = case_when(
ZIP_Average > Average_ZIP_Average ~ "Above average",
ZIP_Average == Average_ZIP_Average ~ "Average",
ZIP_Average < Average_ZIP_Average ~ "Below average",
.default = "Error"))
FMR_RuCo_table <- gt(FMR_RuCo) %>%
tab_header("Rutherford FMR, by size and ZIP") %>%
cols_align(align = "left") %>%
gt_theme_538
FMR_RuCo_table
Rutherford FMR, by size and ZIP | |||||||
ZIP | Studio | BR1 | BR2 | BR3 | BR4 | ZIP_Average | Rent_Category |
---|---|---|---|---|---|---|---|
37037 | 1660 | 1710 | 1920 | 2410 | 2940 | 2128 | Above average |
37086 | 1580 | 1620 | 1820 | 2290 | 2790 | 2020 | Above average |
37128 | 1510 | 1550 | 1740 | 2190 | 2670 | 1932 | Above average |
37129 | 1420 | 1460 | 1640 | 2060 | 2510 | 1818 | Above average |
37153 | 1410 | 1450 | 1630 | 2040 | 2490 | 1804 | Above average |
37167 | 1290 | 1330 | 1490 | 1870 | 2280 | 1652 | Below average |
37085 | 1260 | 1290 | 1450 | 1820 | 2210 | 1606 | Below average |
37127 | 1240 | 1270 | 1430 | 1800 | 2190 | 1586 | Below average |
37130 | 1180 | 1210 | 1360 | 1710 | 2080 | 1508 | Below average |
37132 | 1180 | 1210 | 1360 | 1710 | 2080 | 1508 | Below average |
37118 | 1100 | 1130 | 1270 | 1590 | 1960 | 1410 | Below average |
37149 | 1100 | 1130 | 1270 | 1590 | 1960 | 1410 | Below average |
Now that we have the ZIP codes simplified into ZIP codes with overall “Above average” and “Below average” rents, let’s put these categories to work.
Suppose you and two friends were hunting for a three-bedroom apartment to share. Suppose, also, that each of you had agreed to chip in $600 on each month’s rent, making your total budget 3 x $600 = $1,800 a month.
With only a dozen ZIP codes to look at in the table above, you and your friends could easily see that you probably should begin your search in the 37127 ZIP code, and maybe consider the 37130 and 37132 ZIP codes as well. But if you were searching among the 80 ZIP codes throughout Rutherford, Davidson and Williamson counties, you might appreciate a more efficient way to narrow your search.
Here’s a technique for doing that. Let’s pull the three-bedroom data into a separate data frame called Summary_BR3 and get the average, maximum and minimum three-bedroom rent for the “Above average” and “Below average” ZIP codes, along with the Count of how many ZIP codes are in each category.
Looking at those results would let you know whether you should begin your search among the “Above average” ZIP codes or the “Below average” ZIP codes - or whether your budget is too small for either category or big enough to let you shop in both categories.
# Grouping and summarizing
Summary_BR3 <- FMR_RuCo %>%
group_by(Rent_Category) %>%
summarize(Average = round(mean(BR3), digits = 0),
Minimum = min(BR3),
Maximum = max(BR3),
Count = n())
Summary_BR3_table <- gt(Summary_BR3) %>%
tab_header("Three-bedroom stats, by rent category") %>%
cols_align(align = "left") %>%
gt_theme_538
Summary_BR3_table
Three-bedroom stats, by rent category | ||||
Rent_Category | Average | Minimum | Maximum | Count |
---|---|---|---|---|
Above average | 2198 | 2040 | 2410 | 5 |
Below average | 1727 | 1590 | 1870 | 7 |
A glance at the summary table tells you that you and your two friends likely will be apartment hunting exclusively in the seven “Below average” rent ZIP codes, where average rents range from $1,590 to $1,870 and average $1,727. You’ll probably have fewer choices in the ZIP code with the $1,870 average rent. But the figure is an average, which means at least some of the available apartments should go for less than $1,870.
Meanwhile, it looks like all, or at least most, of the five “Above average” ZIP codes will be out of your price range. Even the lowest average, $2,198, is well above your budget. Again, the figure is an average. You might get lucky and find an $1,800 apartment in that ZIP code. But it would be a long shot.
Of course, you’ll still have to look at the full table to get the actual ZIP codes that fit your budget. But the summary table can let you quickly narrow your search.
You might want to work with this data again at a later time. If so, you could simply rerun the code. Doing so would redownload the raw data and quickly repeat all of the work you have done. But all the same, you might want to save the analyzed data to your computer’s hard drive so you can simply load it from there.
Here’s how you would save the data to your computer using the write_csv() function from the readr package.
The readr package is part of the tidyverse package, so you loaded readr
into memory when you ran the library(tidyverse)
code way
back at the start of the script.
The code will save the FMR_RuCo data frame to your hard drive, in the same folder as this script. The file will be called FMR_RuCo.csv.” You’ll see the file show up under the “Files” tab on the right side of the screen.
You could name the file anything you like, but using a name that repeats the data frame name can make it easier to remember what’s what later on.
The .csv at the end of the file name signals that the file is in comma-separated value format. That means that, in the file, commas indicate where the columns begin and end. You could save the file in other formats, but .csv format is nearly universal.
write_csv(FMR_RuCo, "FMR_RuCo.csv")
It’s easy to reverse the process and read your .csv file back into an R data frame. Just to illustrate how it’s done, here is code that will read the FMR_RuCo.csv file into an R data frame called Retrieved_Data:
Retrieved_Data <- read_csv("FMR_RuCo.csv")