Mutating?

If you want to have a little fun with your friends, try casually mentioning that you learned how to mutate today. Let me know how it goes.

Fortunately (or unfortunately, depending on your attitude about mutation), we won’t be learning today about biological mutation. If you want to fly like Superman or conjure like Scarlet Witch, I can’t help you. All I can do is help you learn about data mutation. It’s not nearly as cool. But it’s still pretty cool.

An (ethical) ChatGPT example

In data science, mutation means creating a new column of data based on one or more pre-existing columns of data. The sports fans in the class might enjoy an example from the athletic world.

Even people (like me) who don’t follow sports closely probably know that basketball players score points by throwing a basketball at a netted hoop, called a goal, that has been positioned horizontally 10 feet off the floor. If the ball falls through the hoop, that’s called a field goal. This is true even though basketball is played on a court, not a field, and American college and pro football players - who play on an actual field - score a field goal by kicking an egg-shaped ball through a goal formed by two vertical posts positioned 18 feet and 6 inches apart. But, I digress.

A basketball field goal is worth one, two, or three points, depending on the circumstances. Not even the best players score a field goal every time they try. As a result, a player’s field goal percentage is an important statistic. It’s the number of successful field goals per every 100 field goal attempts. The larger the field goal percentage, the better the player, assuming the percentage is based on a representatively large number of attempts.

Here’s an example, adapted from the results of a ChatGPT prompt, of R code that uses the mutate() function to calculate the field goal percentages for five made-up basketball players:

# Note: Adapted from R code generated by the ChatGPT
# prompt: "Write an R script that demonstrates how to
# use the mutate() function to calculate field goal
# percentage for each member on a team of basketball
# players."

# Load necessary package
if (!require("tidyverse"))
  install.packages("tidyverse")
library(tidyverse)

# Create a made-up data frame of attempted (FGA) and
# successful (FGM) free throw shots for five basketball # team members

basketball_team <- data.frame(
  Player = c("Player A", "Player B", "Player C", "Player D", "Player E"),
  FGM = c(150, 180, 200, 170, 190),  # Field Goals Made
  FGA = c(300, 400, 450, 380, 420)   # Field Goals Attempted
)

# Print the data frame

print(basketball_team)
##     Player FGM FGA
## 1 Player A 150 300
## 2 Player B 180 400
## 3 Player C 200 450
## 4 Player D 170 380
## 5 Player E 190 420
# Use mutate() to calculate Field Goal Percentage (FG%)

basketball_team <- basketball_team %>%
  mutate(FG_Percentage = (FGM / FGA) * 100)

# Print the updated data frame

print(basketball_team)
##     Player FGM FGA FG_Percentage
## 1 Player A 150 300      50.00000
## 2 Player B 180 400      45.00000
## 3 Player C 200 450      44.44444
## 4 Player D 170 380      44.73684
## 5 Player E 190 420      45.23810

Explaining the code

There’s all sorts of magic happening, here, including use of the data.frame() function to generate a data frame on the fly, the print() function to give you a quick look at the data frame created, and the %>% pipe operator to string the functions together efficently. And, of course, there’s the phenomenon of getting ChatGPT to write most of the code, which is not cheating if you cite the assist. But I’m focusing on what gets done by the line:

mutate(FG_Percentage = (FGM / FGA) * 100)

Here, the mutate() function creates a new column in the data frame, FG_Percentage, by dividing each player’s FGM by the player’s FGA, then multiplying by 100. The / signals division, and the * means multiplication. As you might remember from math class, the order in which you do math can matter. The parentheses in the (FGM / FGA) part tell R to do the division first, then do the multiplication. The line’s other ( and ) tell R to treat the calculations as the mutate() function’s argument.

The preceding line:

basketball_team <- basketball_team %<%

… tells R to overwrite the original basketball_team data frame with the new version that includes the FG_Percentage column.

Rounding

The default number of decimal places on a mutate() result isn’t always what you want. We don’t really need five decimal places worth of precision to get a reasonable idea of a player’s field goal percentage. Being told that Player A’s percentage is 50.00000 tells you nearly nothing you can’t understand by being told that Player A’s percentage is 50.0. One decimal place is plenty.

Fortunately, incorporating a round() function into a mutate() function will get R to round off the values the mutate() function produces:

# Use mutate() to calculate Field Goal Percentage (FG%) and round() to round off the result

basketball_team <- basketball_team %>%
  mutate(FG_Percentage = round((FGM / FGA) * 100,1))

# Print the updated data frame

print(basketball_team)
##     Player FGM FGA FG_Percentage
## 1 Player A 150 300          50.0
## 2 Player B 180 400          45.0
## 3 Player C 200 450          44.4
## 4 Player D 170 380          44.7
## 5 Player E 190 420          45.2

See? Much neater, with no substantial information loss.

The tricky part involves getting the functions, their arguments, and all their required parentheses and commas in the right order. It can drive you batty if you’re not careful. It helps to understand that round() has two requiredarguments: the value you want to round, and the number of digits you want to round to. Like most function arguments, they have to be surrounded by parentheses and separated by a comma. So, going left to right:

Putting one function inside another like this is called nesting the functions. Lots of other data analysis tools can handle nested functions, including Microsoft Excel.

Using mutate() on the rent data

Let’s reload the rent data and see whether we can get mutate() to do something useful to the information. Specifically, let’s look at using mutate() to average the rent estimates for each ZIP code and put the averages in a new column called ZIP_Average. Doing so would give us a convenient, single figure, the average, that we could use to compare the ZIP codes in terms of their overall rental costs, regardless of rental unit size.

Reloading the rent data

This code, which you’ve worked with already, will load and display the 2024 data:

# Installing and loading required packages

if (!require("tidyverse"))
  install.packages("tidyverse")
if (!require("gtExtras"))
  install.packages("gtExtras")

library(tidyverse)
library(gtExtras)
library(readxl)

# Downloading data from:
# https://www.huduser.gov/portal/datasets/fmr/smallarea/index.html#year2024

download.file("https://www.huduser.gov/portal/datasets/fmr/fmr2024/fy2024_safmrs_revised.xlsx", "rent.xlsx", mode = "wb")

# Reading the downloaded Excel file into a data frame called FMR

FMR <- read_xlsx(path = "rent.xlsx", .name_repair = "universal") 

# Making a list of Rutherford County ZIP codes

ZIPList <- c(
  "37127",
  "37128",
  "37129",
  "37130",
  "37132",
  "37085",
  "37118",
  "37149",
  "37037",
  "37153",
  "37167",
  "37086"
)

# Filtering for Rutherford ZIP codes and
# selecting columns of interest

FMR_RuCo <- FMR %>% 
  filter(ZIP.Code %in% ZIPList) %>% 
  select(ZIP.Code,
         SAFMR.0BR,
         SAFMR.1BR,
         SAFMR.2BR,
         SAFMR.3BR,
         SAFMR.4BR) %>% 
  distinct()

# Renaming the columns

colnames(FMR_RuCo) <- c("ZIP",
                          "Studio",
                          "BR1",
                          "BR2",
                          "BR3",
                          "BR4")

# Showing the data as a table

FMR_RuCo_table <- gt(FMR_RuCo) %>% 
  tab_header("Rutherford FMR, by size and ZIP") %>%
  cols_align(align = "left") %>%
  gt_theme_538
FMR_RuCo_table
Rutherford FMR, by size and ZIP
ZIP Studio BR1 BR2 BR3 BR4
37037 1660 1710 1920 2410 2940
37085 1260 1290 1450 1820 2210
37086 1580 1620 1820 2290 2790
37118 1100 1130 1270 1590 1960
37127 1240 1270 1430 1800 2190
37128 1510 1550 1740 2190 2670
37129 1420 1460 1640 2060 2510
37130 1180 1210 1360 1710 2080
37132 1180 1210 1360 1710 2080
37149 1100 1130 1270 1590 1960
37153 1410 1450 1630 2040 2490
37167 1290 1330 1490 1870 2280

Averaging the rents

Adding this code will calculate the average rent for each ZIP code, round the average off to the nearest whole dollar - like the rest of the rent estimates - put the averages in a new column on the right side of the data frame, and show you a new table that includes a column for the averages:

FMR_RuCo <- FMR_RuCo %>%
  mutate(ZIP_Average = round((Studio + BR1 + BR2 + BR3 + BR4) / 5,0))

FMR_RuCo_table <- gt(FMR_RuCo) %>% 
  tab_header("Rutherford FMR, by size and ZIP") %>%
  cols_align(align = "left") %>%
  gt_theme_538
FMR_RuCo_table
Rutherford FMR, by size and ZIP
ZIP Studio BR1 BR2 BR3 BR4 ZIP_Average
37037 1660 1710 1920 2410 2940 2128
37085 1260 1290 1450 1820 2210 1606
37086 1580 1620 1820 2290 2790 2020
37118 1100 1130 1270 1590 1960 1410
37127 1240 1270 1430 1800 2190 1586
37128 1510 1550 1740 2190 2670 1932
37129 1420 1460 1640 2060 2510 1818
37130 1180 1210 1360 1710 2080 1508
37132 1180 1210 1360 1710 2080 1508
37149 1100 1130 1270 1590 1960 1410
37153 1410 1450 1630 2040 2490 1804
37167 1290 1330 1490 1870 2280 1652

See the pattern?

The code works just like the example’s code for calculating field goal percentages:

mutate(ZIP_Average = round((Studio + BR1 + BR2 + BR3 + BR4) / 5,0))

As before, the round() function and its two arguments are nested within the mutate() function and its argument. Only two things have changed:

This is, by the way, about as complex as the math involved in this course will get. If you’re not “mathed out,” you never will be.

See the added value?

Perhaps more importantly, we have made the data more informative. Now, regardless of which particular rental unit size you are interested in, you can quickly get an overall idea of how expensive a given ZIP code is to the others. For example, the 37037 and 37086 ZIP codes stand out as conspicuously expensive, being the only two that exceed $2,000 a month.

In-class exercise

Privacy is nice, but expensive. It doesn’t take a mathematical genius to figure out that four people could rent in the most expensive ZIP code, split the rent four ways, and pay less per person than what it would cost one person to rent a single-bedroom apartment in the cheapest ZIP code.

To practice using the mutate() and round() functions, try adding a column called Split4 that divides each ZIP code’s four-bedroom rent by four and rounds the result to one decimal places, given that, in some cases, the four roommates will end up paying a whole dollar amount plus part of a dollar. Then, run the table code again to show the updated information.

Pro tip: Ctrl / Shift / M (on a PC) or Cmd / Shift / M (on a Mac) will autotype the %<% pipe operator.

You should end up with a table that looks like this:

Rutherford FMR, by size and ZIP
ZIP Studio BR1 BR2 BR3 BR4 ZIP_Average Split4
37037 1660 1710 1920 2410 2940 2128 735.0
37085 1260 1290 1450 1820 2210 1606 552.5
37086 1580 1620 1820 2290 2790 2020 697.5
37118 1100 1130 1270 1590 1960 1410 490.0
37127 1240 1270 1430 1800 2190 1586 547.5
37128 1510 1550 1740 2190 2670 1932 667.5
37129 1420 1460 1640 2060 2510 1818 627.5
37130 1180 1210 1360 1710 2080 1508 520.0
37132 1180 1210 1360 1710 2080 1508 520.0
37149 1100 1130 1270 1590 1960 1410 490.0
37153 1410 1450 1630 2040 2490 1804 622.5
37167 1290 1330 1490 1870 2280 1652 570.0