Mastering Categorical Data in R: A Guide to forcats

Handling Factors and Levels Systematically

Author

Abdullah Al Shamim

Published

February 13, 2026

What will we learn?

  • Inspection: How to check factor levels and frequencies.
  • Cleaning: Removing unused or empty levels.
  • Reordering: Changing the order of levels for better visualization.
  • Recoding: Renaming or collapsing multiple levels into meaningful groups.

1. Setup and Data Inspection

Factors are used to handle categorical variables with fixed and known sets of values. We will use the gss_cat dataset.

Code
library(tidyverse)
theme_set(theme_minimal())

# View the General Social Survey dataset
# view(gss_cat)

# Check the levels of a factor variable
levels(gss_cat$race)
[1] "Other"          "Black"          "White"          "Not applicable"
Code
# View frequency table
gss_cat %>% 
  select(race) %>% 
  table()
race
         Other          Black          White Not applicable 
          1959           3129          16395              0 

2. Removing Unused Levels

Sometimes factor levels exist in the metadata but have zero observations in the data. We can clean these up using fct_drop.

Code
## remove unused levels  
gss_cat %>% 
  mutate(race = fct_drop(race)) %>% 
  select(race) %>% 
  table()
race
Other Black White 
 1959  3129 16395 

3. Modifying Factor Order

Changing the order of factors is essential for creating readable and professional plots.

Manual Reordering (fct_relevel)

Move specific levels to a custom position (e.g., bringing “White” and “Black” to the front).

Code
gss_cat %>% 
  mutate(race = fct_drop(race)) %>% 
  mutate(race = fct_relevel(race, c("White", "Black", "Other"))) %>% 
  select(race) %>% 
  table()
race
White Black Other 
16395  3129  1959 

Order by Frequency (fct_infreq)

Automatically sort factors based on the number of observations.

Code
## order the factors by frequency and reverse
gss_cat %>% 
  mutate(marital = fct_infreq(marital)) %>% 
  mutate(marital = fct_rev(marital)) %>% 
  ggplot(aes(marital)) +
  geom_bar(fill = "steelblue") +
  theme_classic(base_size = 20)

Order by Value in Another Variable (fct_reorder)

This is used to sort categories (like religion) based on a numeric value (like average TV hours).

Code
## order levels of one variable by value in another variable
gss_cat %>% 
  group_by(relig) %>% 
  summarise(TV_watchtime_mean = mean(tvhours, na.rm = TRUE)) %>% 
  mutate(relig = fct_reorder(relig, TV_watchtime_mean)) %>% 
  ggplot(aes(TV_watchtime_mean, relig)) +
  geom_point(size = 5, color = "steelblue") +
  theme_light(base_size = 20)


4. Modifying Factor Levels (Recoding)

If factor names are unclear or redundant, we can rename them.

Code
## count initial party ID levels
gss_cat %>% 
  count(partyid)
# A tibble: 10 × 2
   partyid                n
   <fct>              <int>
 1 No answer            154
 2 Don't know             1
 3 Other party          393
 4 Strong republican   2314
 5 Not str republican  3032
 6 Ind,near rep        1791
 7 Independent         4119
 8 Ind,near dem        2499
 9 Not str democrat    3690
10 Strong democrat     3490
Code
unique(gss_cat$partyid)  
 [1] Ind,near rep       Not str republican Independent        Not str democrat  
 [5] Strong democrat    Ind,near dem       Strong republican  Other party       
 [9] No answer          Don't know        
10 Levels: No answer Don't know Other party ... Strong democrat

Renaming Levels (fct_recode)

Code
## combine similar levels into "others"
gss_cat %>% 
  mutate(partyid = fct_recode(partyid, 
                              "others" = "Other party",
                              "others" = "No answer",
                              "others" = "Don't know")) %>% 
  count(partyid)
# A tibble: 8 × 2
  partyid                n
  <fct>              <int>
1 others               548
2 Strong republican   2314
3 Not str republican  3032
4 Ind,near rep        1791
5 Independent         4119
6 Ind,near dem        2499
7 Not str democrat    3690
8 Strong democrat     3490
Code
## full recode/rename of levels
gss_cat %>% 
  mutate(partyid = fct_recode(partyid, 
                              "others" = "Other party",
                              "others" = "No answer",
                              "others" = "Don't know",
                              "Not Strong republican" = "Not str republican",
                              "Slightly republican" = "Ind,near rep",
                              "Slightly democrat" = "Ind,near dem",
                              "Not Strong democrat" = "Not str democrat")) %>% 
  count(partyid)
# A tibble: 8 × 2
  partyid                   n
  <fct>                 <int>
1 others                  548
2 Strong republican      2314
3 Not Strong republican  3032
4 Slightly republican    1791
5 Independent            4119
6 Slightly democrat      2499
7 Not Strong democrat    3690
8 Strong democrat        3490

Grouping Levels (fct_collapse)

This is a more efficient way to group multiple levels into single categories simultaneously.

Code
### reorder the similar levels with factor collapse function
gss_cat %>% 
  mutate(partyid = fct_collapse(partyid, 
                                "others" = c("Other party",
                                             "No answer",
                                             "Don't know"),
                                "Republican" = c("Not str republican",
                                                 "Strong republican"),
                                "Democrat" = c("Strong democrat",
                                               "Not str democrat"),
                                "Independent" = c("Ind,near dem", 
                                                  "Ind,near rep"))) %>% 
  count(partyid)
# A tibble: 4 × 2
  partyid         n
  <fct>       <int>
1 others        548
2 Republican   5346
3 Independent  8409
4 Democrat     7180

Systematic Checklist (Cheat Sheet):

  • Drop Empty Levels: fct_drop()
  • Manual Order: fct_relevel()
  • Frequency Order: fct_infreq()
  • Sort by Numbers: fct_reorder()
  • Rename Levels: fct_recode()
  • Group/Collapse: fct_collapse()

Excellent work! You have mastered the most important tools for handling categorical data in R.