library(readr)
library(dplyr)
Suppose our dataset looks something like this
fruits <- read_csv(
"name, quantity
gala apples, 3
red delicious, 2
bananas, 4
oranges, 6
grape-fruits, 2
blueberries, 20
straw-berries, 15
strawberries, 12
")
fruits
| name | quantity |
|---|---|
| gala apples | 3 |
| red delicious | 2 |
| bananas | 4 |
| oranges | 6 |
| grape-fruits | 2 |
| blueberries | 20 |
| straw-berries | 15 |
| strawberries | 12 |
Let’s assume that we wanted to correct the spelling or standardize the value of an existing column. For example, in our case, strawberries are sometimes spelled with a hyphen and sometimes without a hyphen. And sometimes the dataset uses specific variety of apples instead of just using apples so we need to fix that as well.
dplyrFirst let’s make a copy of the name variable to ensure that we don’t overwrite the original variable by mistake.
fruits$new_name <- fruits$name
fruits
| name | quantity | new_name |
|---|---|---|
| gala apples | 3 | gala apples |
| red delicious | 2 | red delicious |
| bananas | 4 | bananas |
| oranges | 6 | oranges |
| grape-fruits | 2 | grape-fruits |
| blueberries | 20 | blueberries |
| straw-berries | 15 | straw-berries |
| strawberries | 12 | strawberries |
Now let’s fix apples:
fruits$new_name[fruits$name == "gala apples" | fruits$name == "red delicious"] = "apples"
fruits
| name | quantity | new_name |
|---|---|---|
| gala apples | 3 | apples |
| red delicious | 2 | apples |
| bananas | 4 | bananas |
| oranges | 6 | oranges |
| grape-fruits | 2 | grape-fruits |
| blueberries | 20 | blueberries |
| straw-berries | 15 | straw-berries |
| strawberries | 12 | strawberries |
You can repeat this as many time as you want, and if you have more than one value to compare against then it’s best to use %in%
fruits$new_name[fruits$name %in% c("grapefruits", "grape-fruits")] = "grapefruits"
fruits$new_name[fruits$name %in% c("straw-berries", "strawberries")] = "strawberries"
fruits
| name | quantity | new_name |
|---|---|---|
| gala apples | 3 | apples |
| red delicious | 2 | apples |
| bananas | 4 | bananas |
| oranges | 6 | oranges |
| grape-fruits | 2 | grapefruits |
| blueberries | 20 | blueberries |
| straw-berries | 15 | strawberries |
| strawberries | 12 | strawberries |
dplyr SolutionHere’s how you’d do the same with dplyr using mutate and ifelse. First let’s recreate the dataset
fruits <- read_csv(
"name, quantity
gala apples, 3
red delicious, 2
bananas, 4
oranges, 6
grape-fruits, 2
blueberries, 20
straw-berries, 15
strawberries, 12
")
fruits
| name | quantity |
|---|---|
| gala apples | 3 |
| red delicious | 2 |
| bananas | 4 |
| oranges | 6 |
| grape-fruits | 2 |
| blueberries | 20 |
| straw-berries | 15 |
| strawberries | 12 |
Then we copy name to new_name, and call mutate() for as many values as we need to recode.
fruits <- fruits %>%
mutate(new_name = name) %>%
mutate(new_name = ifelse(name %in% c("gala apples", "red delicious"), "apples", new_name)) %>%
mutate(new_name = ifelse(name %in% c("grapefruits", "grape-fruits"), "grapefruits", new_name)) %>%
mutate(new_name = ifelse(name %in% c("straw-berries", "strawberries"), "strawberries", new_name))
fruits
| name | quantity | new_name |
|---|---|---|
| gala apples | 3 | apples |
| red delicious | 2 | apples |
| bananas | 4 | bananas |
| oranges | 6 | oranges |
| grape-fruits | 2 | grapefruits |
| blueberries | 20 | blueberries |
| straw-berries | 15 | strawberries |
| strawberries | 12 | strawberries |
As you can see, both the solutions give us exactly the same results, so you can use whichever one you prefer.