Mutating

Mutating?

If you want to have a little fun with your friends, try casually mentioning that you learned how to mutate today. Let me know how it goes.

Fortunately (or unfortunately, depending on your attitude about mutation), we won’t be learning today about biological mutation. If you want to fly like Superman or conjure like Scarlet Witch, I can’t help you. All I can do is help you learn about data mutation. It’s not nearly as cool. But it’s still pretty cool.

An (ethical) ChatGPT example

In data science, mutation means creating a new column of data based on one or more pre-existing columns of data. The sports fans in the class might enjoy an example from the athletic world.

Even people (like me) who don’t follow sports closely probably know that basketball players score points by throwing a basketball at a netted hoop, called a goal, that has been positioned horizontally 10 feet off the floor. If the ball falls through the hoop, that’s called a field goal. This is true even though basketball is played on a court, not a field, and American college and pro football players - who play on an actual field - score a field goal by kicking an egg-shaped ball through a goal formed by two vertical posts positioned 18 feet and 6 inches apart. But, I digress.

A basketball field goal is worth one, two, or three points, depending on the circumstances. Not even the best players score a field goal every time they try. As a result, a player’s field goal percentage is an important statistic. It’s the number of successful field goals per every 100 field goal attempts. The larger the field goal percentage, the better the player, assuming the percentage is based on a representatively large number of attempts.

Here’s an example, adapted from the results of a ChatGPT prompt, of R code that uses the mutate() function to calculate the field goal percentages for five made-up basketball players:

# Note: Adapted from R code generated by the ChatGPT
# prompt: "Write an R script that demonstrates how to
# use the mutate() function to calculate field goal
# percentage for each member on a team of basketball
# players."

# Load necessary package
if (!require("tidyverse"))
  install.packages("tidyverse")
library(tidyverse)

# Create a made-up data frame of attempted (FGA) and
# successful (FGM) free throw shots for five basketball # team members

basketball_team <- data.frame(
  Player = c("Player A", "Player B", "Player C", "Player D", "Player E"),
  FGM = c(150, 180, 200, 170, 190),  # Field Goals Made
  FGA = c(300, 400, 450, 380, 420)   # Field Goals Attempted
)

# Print the data frame

print(basketball_team)

##     Player FGM FGA
## 1 Player A 150 300
## 2 Player B 180 400
## 3 Player C 200 450
## 4 Player D 170 380
## 5 Player E 190 420

# Use mutate() to calculate Field Goal Percentage (FG%)

basketball_team <- basketball_team %>%
  mutate(FG_Percentage = (FGM / FGA) * 100)

# Print the updated data frame

print(basketball_team)

##     Player FGM FGA FG_Percentage
## 1 Player A 150 300      50.00000
## 2 Player B 180 400      45.00000
## 3 Player C 200 450      44.44444
## 4 Player D 170 380      44.73684
## 5 Player E 190 420      45.23810

Explaining the code

There’s all sorts of magic happening, here, including use of the data.frame() function to generate a data frame on the fly, the print() function to give you a quick look at the data frame created, and the %>% pipe operator to string the functions together efficently. And, of course, there’s the phenomenon of getting ChatGPT to write most of the code, which is not cheating if you cite the assist. But I’m focusing on what gets done by the line:

mutate(FG_Percentage = (FGM / FGA) * 100)

Here, the mutate() function creates a new column in the data frame, FG_Percentage, by dividing each player’s FGM by the player’s FGA, then multiplying by 100. The / signals division, and the * means multiplication. As you might remember from math class, the order in which you do math can matter. The parentheses in the (FGM / FGA) part tell R to do the division first, then do the multiplication. The line’s other ( and ) tell R to treat the calculations as the mutate() function’s argument.

The preceding line:

basketball_team <- basketball_team %<%

… tells R to overwrite the original basketball_team data frame with the new version that includes the FG_Percentage column.

Rounding

The default number of decimal places on a mutate() result isn’t always what you want. We don’t really need five decimal places worth of precision to get a reasonable idea of a player’s field goal percentage. Being told that Player A’s percentage is 50.00000 tells you nearly nothing you can’t understand by being told that Player A’s percentage is 50.0. One decimal place is plenty.

Fortunately, incorporating a round() function into a mutate() function will get R to round off the values the mutate() function produces:

# Use mutate() to calculate Field Goal Percentage (FG%) and round() to round off the result

basketball_team <- basketball_team %>%
  mutate(FG_Percentage = round((FGM / FGA) * 100,1))

# Print the updated data frame

print(basketball_team)

##     Player FGM FGA FG_Percentage
## 1 Player A 150 300          50.0
## 2 Player B 180 400          45.0
## 3 Player C 200 450          44.4
## 4 Player D 170 380          44.7
## 5 Player E 190 420          45.2

See? Much neater, with no substantial information loss.

The tricky part involves getting the functions, their arguments, and all their required parentheses and commas in the right order. It can drive you batty if you’re not careful. It helps to understand that round() has two requiredarguments: the value you want to round, and the number of digits you want to round to. Like most function arguments, they have to be surrounded by parentheses and separated by a comma. So, going left to right:

mutate(FG_Percentage = gets things started by naming the mutate function, providing its (, the name of the variable being created, FG_Percentage, and the = sign.
Instead of finishing out the mutate() function as before, though, we introduce round and its (, followed by what we want to round, the (FGM / FGA) * 100 calculation. That’s the round() function’s first argument.
Next comes a , to separate round()’s first argument from its second, which is a 1, signaling that we want R to round the (FGM / FGA) * 100 calculation to one decimal place.
All we have left to do then is finish out both functions by providing the ) for the round() function as well as the ) for the mutate() function.

Putting one function inside another like this is called nesting the functions. Lots of other data analysis tools can handle nested functions, including Microsoft Excel.

Using `mutate()` on the rent data

Let’s reload the rent data and see whether we can get mutate() to do something useful to the information. Specifically, let’s look at using mutate() to average the rent estimates for each ZIP code and put the averages in a new column called ZIP_Average. Doing so would give us a convenient, single figure, the average, that we could use to compare the ZIP codes in terms of their overall rental costs, regardless of rental unit size.

Reloading the rent data

This code, which you’ve worked with already, will load and display the 2024 data:

# Installing and loading required packages

if (!require("tidyverse"))
  install.packages("tidyverse")
if (!require("gtExtras"))
  install.packages("gtExtras")

library(tidyverse)
library(gtExtras)
library(readxl)

# Downloading data from:
# https://www.huduser.gov/portal/datasets/fmr/smallarea/index.html#year2024

download.file("https://www.huduser.gov/portal/datasets/fmr/fmr2024/fy2024_safmrs_revised.xlsx", "rent.xlsx", mode = "wb")

# Reading the downloaded Excel file into a data frame called FMR

FMR <- read_xlsx(path = "rent.xlsx", .name_repair = "universal") 

# Making a list of Rutherford County ZIP codes

ZIPList <- c(
  "37127",
  "37128",
  "37129",
  "37130",
  "37132",
  "37085",
  "37118",
  "37149",
  "37037",
  "37153",
  "37167",
  "37086"
)

# Filtering for Rutherford ZIP codes and
# selecting columns of interest

FMR_RuCo <- FMR %>% 
  filter(ZIP.Code %in% ZIPList) %>% 
  select(ZIP.Code,
         SAFMR.0BR,
         SAFMR.1BR,
         SAFMR.2BR,
         SAFMR.3BR,
         SAFMR.4BR) %>% 
  distinct()

# Renaming the columns

colnames(FMR_RuCo) <- c("ZIP",
                          "Studio",
                          "BR1",
                          "BR2",
                          "BR3",
                          "BR4")

# Showing the data as a table

FMR_RuCo_table <- gt(FMR_RuCo) %>% 
  tab_header("Rutherford FMR, by size and ZIP") %>%
  cols_align(align = "left") %>%
  gt_theme_538
FMR_RuCo_table

ZIP	Studio	BR1	BR2	BR3	BR4
Rutherford FMR, by size and ZIP
37037	1660	1710	1920	2410	2940
37085	1260	1290	1450	1820	2210
37086	1580	1620	1820	2290	2790
37118	1100	1130	1270	1590	1960
37127	1240	1270	1430	1800	2190
37128	1510	1550	1740	2190	2670
37129	1420	1460	1640	2060	2510
37130	1180	1210	1360	1710	2080
37132	1180	1210	1360	1710	2080
37149	1100	1130	1270	1590	1960
37153	1410	1450	1630	2040	2490
37167	1290	1330	1490	1870	2280

Averaging the rents

Adding this code will calculate the average rent for each ZIP code, round the average off to the nearest whole dollar - like the rest of the rent estimates - put the averages in a new column on the right side of the data frame, and show you a new table that includes a column for the averages:

FMR_RuCo <- FMR_RuCo %>%
  mutate(ZIP_Average = round((Studio + BR1 + BR2 + BR3 + BR4) / 5,0))

FMR_RuCo_table <- gt(FMR_RuCo) %>% 
  tab_header("Rutherford FMR, by size and ZIP") %>%
  cols_align(align = "left") %>%
  gt_theme_538
FMR_RuCo_table

ZIP	Studio	BR1	BR2	BR3	BR4	ZIP_Average
Rutherford FMR, by size and ZIP
37037	1660	1710	1920	2410	2940	2128
37085	1260	1290	1450	1820	2210	1606
37086	1580	1620	1820	2290	2790	2020
37118	1100	1130	1270	1590	1960	1410
37127	1240	1270	1430	1800	2190	1586
37128	1510	1550	1740	2190	2670	1932
37129	1420	1460	1640	2060	2510	1818
37130	1180	1210	1360	1710	2080	1508
37132	1180	1210	1360	1710	2080	1508
37149	1100	1130	1270	1590	1960	1410
37153	1410	1450	1630	2040	2490	1804
37167	1290	1330	1490	1870	2280	1652

See the pattern?

The code works just like the example’s code for calculating field goal percentages:

mutate(ZIP_Average = round((Studio + BR1 + BR2 + BR3 + BR4) / 5,0))

As before, the round() function and its two arguments are nested within the mutate() function and its argument. Only two things have changed:

The calculation: (Studio + BR1 + BR2 + BR3 + BR4) / 5, meaning, “Add the rent estimates, then divide by 5, the number of estimates.”
The number of decimal places to round to: 0.

This is, by the way, about as complex as the math involved in this course will get. If you’re not “mathed out,” you never will be.

See the added value?

Perhaps more importantly, we have made the data more informative. Now, regardless of which particular rental unit size you are interested in, you can quickly get an overall idea of how expensive a given ZIP code is to the others. For example, the 37037 and 37086 ZIP codes stand out as conspicuously expensive, being the only two that exceed $2,000 a month.

In-class exercise

Privacy is nice, but expensive. It doesn’t take a mathematical genius to figure out that four people could rent in the most expensive ZIP code, split the rent four ways, and pay less per person than what it would cost one person to rent a single-bedroom apartment in the cheapest ZIP code.

To practice using the mutate() and round() functions, try adding a column called Split4 that divides each ZIP code’s four-bedroom rent by four and rounds the result to one decimal places, given that, in some cases, the four roommates will end up paying a whole dollar amount plus part of a dollar. Then, run the table code again to show the updated information.

Pro tip: Ctrl / Shift / M (on a PC) or Cmd / Shift / M (on a Mac) will autotype the %<% pipe operator.

You should end up with a table that looks like this:

ZIP	Studio	BR1	BR2	BR3	BR4	ZIP_Average	Split4
Rutherford FMR, by size and ZIP
37037	1660	1710	1920	2410	2940	2128	735.0
37085	1260	1290	1450	1820	2210	1606	552.5
37086	1580	1620	1820	2290	2790	2020	697.5
37118	1100	1130	1270	1590	1960	1410	490.0
37127	1240	1270	1430	1800	2190	1586	547.5
37128	1510	1550	1740	2190	2670	1932	667.5
37129	1420	1460	1640	2060	2510	1818	627.5
37130	1180	1210	1360	1710	2080	1508	520.0
37132	1180	1210	1360	1710	2080	1508	520.0
37149	1100	1130	1270	1590	1960	1410	490.0
37153	1410	1450	1630	2040	2490	1804	622.5
37167	1290	1330	1490	1870	2280	1652	570.0

Mutating

Ken Blake

2025-02-01

Mutating?

An (ethical) ChatGPT example

Using `mutate()` on the rent data

In-class exercise

Mutating

Ken Blake

2025-02-01

Mutating?

An (ethical) ChatGPT example

Using mutate() on the rent data

In-class exercise

Using `mutate()` on the rent data