Mastering Categorical Data in R: A Guide to forcats
Handling Factors and Levels Systematically
Author
Abdullah Al Shamim
Published
February 13, 2026
What will we learn?
Inspection: How to check factor levels and frequencies.
Cleaning: Removing unused or empty levels.
Reordering: Changing the order of levels for better visualization.
Recoding: Renaming or collapsing multiple levels into meaningful groups.
1. Setup and Data Inspection
Factors are used to handle categorical variables with fixed and known sets of values. We will use the gss_cat dataset.
Code
library(tidyverse)theme_set(theme_minimal())# View the General Social Survey dataset# view(gss_cat)# Check the levels of a factor variablelevels(gss_cat$race)
[1] "Other" "Black" "White" "Not applicable"
Code
# View frequency tablegss_cat %>%select(race) %>%table()
race
Other Black White Not applicable
1959 3129 16395 0
2. Removing Unused Levels
Sometimes factor levels exist in the metadata but have zero observations in the data. We can clean these up using fct_drop.
Automatically sort factors based on the number of observations.
Code
## order the factors by frequency and reversegss_cat %>%mutate(marital =fct_infreq(marital)) %>%mutate(marital =fct_rev(marital)) %>%ggplot(aes(marital)) +geom_bar(fill ="steelblue") +theme_classic(base_size =20)
Order by Value in Another Variable (fct_reorder)
This is used to sort categories (like religion) based on a numeric value (like average TV hours).
Code
## order levels of one variable by value in another variablegss_cat %>%group_by(relig) %>%summarise(TV_watchtime_mean =mean(tvhours, na.rm =TRUE)) %>%mutate(relig =fct_reorder(relig, TV_watchtime_mean)) %>%ggplot(aes(TV_watchtime_mean, relig)) +geom_point(size =5, color ="steelblue") +theme_light(base_size =20)
4. Modifying Factor Levels (Recoding)
If factor names are unclear or redundant, we can rename them.
Code
## count initial party ID levelsgss_cat %>%count(partyid)
# A tibble: 10 × 2
partyid n
<fct> <int>
1 No answer 154
2 Don't know 1
3 Other party 393
4 Strong republican 2314
5 Not str republican 3032
6 Ind,near rep 1791
7 Independent 4119
8 Ind,near dem 2499
9 Not str democrat 3690
10 Strong democrat 3490
Code
unique(gss_cat$partyid)
[1] Ind,near rep Not str republican Independent Not str democrat
[5] Strong democrat Ind,near dem Strong republican Other party
[9] No answer Don't know
10 Levels: No answer Don't know Other party ... Strong democrat
Renaming Levels (fct_recode)
Code
## combine similar levels into "others"gss_cat %>%mutate(partyid =fct_recode(partyid, "others"="Other party","others"="No answer","others"="Don't know")) %>%count(partyid)
# A tibble: 8 × 2
partyid n
<fct> <int>
1 others 548
2 Strong republican 2314
3 Not str republican 3032
4 Ind,near rep 1791
5 Independent 4119
6 Ind,near dem 2499
7 Not str democrat 3690
8 Strong democrat 3490