If you want to have a little fun with your friends, try casually mentioning that you learned how to mutate today. Let me know how it goes.
Fortunately (or unfortunately, depending on your attitude about mutation), we won’t be learning today about biological mutation. If you want to fly like Superman or conjure like Scarlet Witch, I can’t help you. All I can do is help you learn about data mutation. It’s not nearly as cool. But it’s still pretty cool.
In data science, mutation means creating a new column of data based on one or more pre-existing columns of data. The sports fans in the class might enjoy an example from the athletic world.
Even people (like me) who don’t follow sports closely probably know that basketball players score points by throwing a basketball at a netted hoop, called a goal, that has been positioned horizontally 10 feet off the floor. If the ball falls through the hoop, that’s called a field goal. This is true even though basketball is played on a court, not a field, and American college and pro football players - who play on an actual field - score a field goal by kicking an egg-shaped ball through a goal formed by two vertical posts positioned 18 feet and 6 inches apart. But, I digress.
A basketball field goal is worth one, two, or three points, depending on the circumstances. Not even the best players score a field goal every time they try. As a result, a player’s field goal percentage is an important statistic. It’s the number of successful field goals per every 100 field goal attempts. The larger the field goal percentage, the better the player, assuming the percentage is based on a representatively large number of attempts.
Here’s an example, adapted from the results of a ChatGPT prompt, of R
code that uses the mutate()
function to calculate the
field goal percentages for five made-up basketball players:
# Note: Adapted from R code generated by the ChatGPT
# prompt: "Write an R script that demonstrates how to
# use the mutate() function to calculate field goal
# percentage for each member on a team of basketball
# players."
# Load necessary package
if (!require("tidyverse"))
install.packages("tidyverse")
library(tidyverse)
# Create a made-up data frame of attempted (FGA) and
# successful (FGM) free throw shots for five basketball # team members
basketball_team <- data.frame(
Player = c("Player A", "Player B", "Player C", "Player D", "Player E"),
FGM = c(150, 180, 200, 170, 190), # Field Goals Made
FGA = c(300, 400, 450, 380, 420) # Field Goals Attempted
)
# Print the data frame
print(basketball_team)
## Player FGM FGA
## 1 Player A 150 300
## 2 Player B 180 400
## 3 Player C 200 450
## 4 Player D 170 380
## 5 Player E 190 420
# Use mutate() to calculate Field Goal Percentage (FG%)
basketball_team <- basketball_team %>%
mutate(FG_Percentage = (FGM / FGA) * 100)
# Print the updated data frame
print(basketball_team)
## Player FGM FGA FG_Percentage
## 1 Player A 150 300 50.00000
## 2 Player B 180 400 45.00000
## 3 Player C 200 450 44.44444
## 4 Player D 170 380 44.73684
## 5 Player E 190 420 45.23810
Explaining the code
There’s all sorts of magic happening, here, including use of the data.frame()
function to generate a
data frame on the fly, the print()
function to give you a quick
look at the data frame created, and the %>%
pipe operator to string the
functions together efficently. And, of course, there’s the phenomenon of
getting ChatGPT to write most of the code, which is not cheating if you
cite the assist. But I’m focusing on what gets done by the line:
mutate(FG_Percentage = (FGM / FGA) * 100)
Here, the mutate()
function creates a new column
in the data frame, FG_Percentage
, by dividing each player’s
FGM
by the player’s FGA
, then multiplying by
100. The /
signals division, and the *
means
multiplication. As you might remember from math class, the order in
which you do math can matter. The parentheses in the
(FGM / FGA)
part tell R to do the division first, then do
the multiplication. The line’s other (
and )
tell R to treat the calculations as the mutate()
function’s
argument.
The preceding line:
basketball_team <- basketball_team %<%
… tells R to overwrite the original basketball_team
data
frame with the new version that includes the FG_Percentage
column.
Rounding
The default number of decimal places on a mutate()
result isn’t always what you want. We don’t really need five decimal
places worth of precision to get a reasonable idea of a player’s field
goal percentage. Being told that Player A’s percentage is 50.00000 tells
you nearly nothing you can’t understand by being told that Player A’s
percentage is 50.0. One decimal place is plenty.
Fortunately, incorporating a round()
function into a
mutate()
function will get R to round off the values the
mutate()
function produces:
# Use mutate() to calculate Field Goal Percentage (FG%) and round() to round off the result
basketball_team <- basketball_team %>%
mutate(FG_Percentage = round((FGM / FGA) * 100,1))
# Print the updated data frame
print(basketball_team)
## Player FGM FGA FG_Percentage
## 1 Player A 150 300 50.0
## 2 Player B 180 400 45.0
## 3 Player C 200 450 44.4
## 4 Player D 170 380 44.7
## 5 Player E 190 420 45.2
See? Much neater, with no substantial information loss.
The tricky part involves getting the functions, their arguments, and
all their required parentheses and commas in the right order. It can
drive you batty if you’re not careful. It helps to understand that
round()
has two requiredarguments: the value you want to
round, and the number of digits you want to round to. Like most function
arguments, they have to be surrounded by parentheses and separated by a
comma. So, going left to right:
mutate(FG_Percentage =
gets things started by naming
the mutate function, providing its (
, the name of the
variable being created, FG_Percentage
, and the
=
sign.
Instead of finishing out the mutate()
function as
before, though, we introduce round
and its (
,
followed by what we want to round, the (FGM / FGA) * 100
calculation. That’s the round()
function’s first
argument.
Next comes a ,
to separate round()
’s
first argument from its second, which is a 1
, signaling
that we want R to round the (FGM / FGA) * 100
calculation
to one decimal place.
All we have left to do then is finish out both functions by
providing the )
for the round()
function as
well as the )
for the mutate()
function.
Putting one function inside another like this is called nesting the functions. Lots of other data analysis tools can handle nested functions, including Microsoft Excel.
mutate()
on the rent dataLet’s reload the rent data and see whether we can get
mutate()
to do something useful to the information.
Specifically, let’s look at using mutate()
to
average the rent estimates for each ZIP code and put
the averages in a new column called ZIP_Average
. Doing so
would give us a convenient, single figure, the average, that we could
use to compare the ZIP codes in terms of their overall rental costs,
regardless of rental unit size.
Reloading the rent data
This code, which you’ve worked with already, will load and display the 2024 data:
# Installing and loading required packages
if (!require("tidyverse"))
install.packages("tidyverse")
if (!require("gtExtras"))
install.packages("gtExtras")
library(tidyverse)
library(gtExtras)
library(readxl)
# Downloading data from:
# https://www.huduser.gov/portal/datasets/fmr/smallarea/index.html#year2024
download.file("https://www.huduser.gov/portal/datasets/fmr/fmr2024/fy2024_safmrs_revised.xlsx", "rent.xlsx", mode = "wb")
# Reading the downloaded Excel file into a data frame called FMR
FMR <- read_xlsx(path = "rent.xlsx", .name_repair = "universal")
# Making a list of Rutherford County ZIP codes
ZIPList <- c(
"37127",
"37128",
"37129",
"37130",
"37132",
"37085",
"37118",
"37149",
"37037",
"37153",
"37167",
"37086"
)
# Filtering for Rutherford ZIP codes and
# selecting columns of interest
FMR_RuCo <- FMR %>%
filter(ZIP.Code %in% ZIPList) %>%
select(ZIP.Code,
SAFMR.0BR,
SAFMR.1BR,
SAFMR.2BR,
SAFMR.3BR,
SAFMR.4BR) %>%
distinct()
# Renaming the columns
colnames(FMR_RuCo) <- c("ZIP",
"Studio",
"BR1",
"BR2",
"BR3",
"BR4")
# Showing the data as a table
FMR_RuCo_table <- gt(FMR_RuCo) %>%
tab_header("Rutherford FMR, by size and ZIP") %>%
cols_align(align = "left") %>%
gt_theme_538
FMR_RuCo_table
Rutherford FMR, by size and ZIP | |||||
ZIP | Studio | BR1 | BR2 | BR3 | BR4 |
---|---|---|---|---|---|
37037 | 1660 | 1710 | 1920 | 2410 | 2940 |
37085 | 1260 | 1290 | 1450 | 1820 | 2210 |
37086 | 1580 | 1620 | 1820 | 2290 | 2790 |
37118 | 1100 | 1130 | 1270 | 1590 | 1960 |
37127 | 1240 | 1270 | 1430 | 1800 | 2190 |
37128 | 1510 | 1550 | 1740 | 2190 | 2670 |
37129 | 1420 | 1460 | 1640 | 2060 | 2510 |
37130 | 1180 | 1210 | 1360 | 1710 | 2080 |
37132 | 1180 | 1210 | 1360 | 1710 | 2080 |
37149 | 1100 | 1130 | 1270 | 1590 | 1960 |
37153 | 1410 | 1450 | 1630 | 2040 | 2490 |
37167 | 1290 | 1330 | 1490 | 1870 | 2280 |
Averaging the rents
Adding this code will calculate the average rent for each ZIP code, round the average off to the nearest whole dollar - like the rest of the rent estimates - put the averages in a new column on the right side of the data frame, and show you a new table that includes a column for the averages:
FMR_RuCo <- FMR_RuCo %>%
mutate(ZIP_Average = round((Studio + BR1 + BR2 + BR3 + BR4) / 5,0))
FMR_RuCo_table <- gt(FMR_RuCo) %>%
tab_header("Rutherford FMR, by size and ZIP") %>%
cols_align(align = "left") %>%
gt_theme_538
FMR_RuCo_table
Rutherford FMR, by size and ZIP | ||||||
ZIP | Studio | BR1 | BR2 | BR3 | BR4 | ZIP_Average |
---|---|---|---|---|---|---|
37037 | 1660 | 1710 | 1920 | 2410 | 2940 | 2128 |
37085 | 1260 | 1290 | 1450 | 1820 | 2210 | 1606 |
37086 | 1580 | 1620 | 1820 | 2290 | 2790 | 2020 |
37118 | 1100 | 1130 | 1270 | 1590 | 1960 | 1410 |
37127 | 1240 | 1270 | 1430 | 1800 | 2190 | 1586 |
37128 | 1510 | 1550 | 1740 | 2190 | 2670 | 1932 |
37129 | 1420 | 1460 | 1640 | 2060 | 2510 | 1818 |
37130 | 1180 | 1210 | 1360 | 1710 | 2080 | 1508 |
37132 | 1180 | 1210 | 1360 | 1710 | 2080 | 1508 |
37149 | 1100 | 1130 | 1270 | 1590 | 1960 | 1410 |
37153 | 1410 | 1450 | 1630 | 2040 | 2490 | 1804 |
37167 | 1290 | 1330 | 1490 | 1870 | 2280 | 1652 |
See the pattern?
The code works just like the example’s code for calculating field goal percentages:
mutate(ZIP_Average = round((Studio + BR1 + BR2 + BR3 + BR4) / 5,0))
As before, the round() function and its two arguments are nested within the mutate() function and its argument. Only two things have changed:
The calculation:
(Studio + BR1 + BR2 + BR3 + BR4) / 5
, meaning, “Add the
rent estimates, then divide by 5, the number of estimates.”
The number of decimal places to round to:
0
.
This is, by the way, about as complex as the math involved in this course will get. If you’re not “mathed out,” you never will be.
See the added value?
Perhaps more importantly, we have made the data more informative. Now, regardless of which particular rental unit size you are interested in, you can quickly get an overall idea of how expensive a given ZIP code is to the others. For example, the 37037 and 37086 ZIP codes stand out as conspicuously expensive, being the only two that exceed $2,000 a month.
Privacy is nice, but expensive. It doesn’t take a mathematical genius to figure out that four people could rent in the most expensive ZIP code, split the rent four ways, and pay less per person than what it would cost one person to rent a single-bedroom apartment in the cheapest ZIP code.
To practice using the mutate()
and round()
functions, try adding a column called Split4
that divides
each ZIP code’s four-bedroom rent by four and rounds the result to one
decimal places, given that, in some cases, the four roommates will end
up paying a whole dollar amount plus part of a dollar. Then, run the
table code again to show the updated information.
Pro tip: Ctrl / Shift / M (on a PC) or Cmd /
Shift / M (on a Mac) will autotype the %<%
pipe
operator.
You should end up with a table that looks like this:
Rutherford FMR, by size and ZIP | |||||||
ZIP | Studio | BR1 | BR2 | BR3 | BR4 | ZIP_Average | Split4 |
---|---|---|---|---|---|---|---|
37037 | 1660 | 1710 | 1920 | 2410 | 2940 | 2128 | 735.0 |
37085 | 1260 | 1290 | 1450 | 1820 | 2210 | 1606 | 552.5 |
37086 | 1580 | 1620 | 1820 | 2290 | 2790 | 2020 | 697.5 |
37118 | 1100 | 1130 | 1270 | 1590 | 1960 | 1410 | 490.0 |
37127 | 1240 | 1270 | 1430 | 1800 | 2190 | 1586 | 547.5 |
37128 | 1510 | 1550 | 1740 | 2190 | 2670 | 1932 | 667.5 |
37129 | 1420 | 1460 | 1640 | 2060 | 2510 | 1818 | 627.5 |
37130 | 1180 | 1210 | 1360 | 1710 | 2080 | 1508 | 520.0 |
37132 | 1180 | 1210 | 1360 | 1710 | 2080 | 1508 | 520.0 |
37149 | 1100 | 1130 | 1270 | 1590 | 1960 | 1410 | 490.0 |
37153 | 1410 | 1450 | 1630 | 2040 | 2490 | 1804 | 622.5 |
37167 | 1290 | 1330 | 1490 | 1870 | 2280 | 1652 | 570.0 |