XiaoFeiMei_Forcats_package_Vignette

Author

XiaoFei Mei

Introduction

The task is to practice collaborating around a code project with GitHub. Each DATA607 student create a vignette about one or more TidyVerse package with actual usage examples. Together we collective work as building out a book of examples on how to use TidyVerse functions.

The package I choose is the forcats package, specifically the following functions: fct_lump_n, fct_reorder, fct_collapse and fct_na_value_to_level.

To demonstrate usage of discussed function, I’ll be using a dataset of sold residential properties in Florida sourced from Kaggle. The dataset contains listing prices, sale prices, square footage, and categorical variables like type and sub_type that describe each property.

Import Data

A copy of csv file in a separate repository that’s access to public. Anyone would be able to directly load the csv file by running this code with the URL below.

re <- read_csv("https://raw.githubusercontent.com/xiaofeimei1/datasetPublic/refs/heads/main/florida_real_estate_sold_properties_ultimate.csv")

Check unique property types from the dataset.

re |> count(type, sort = TRUE)
# A tibble: 8 × 2
  type                            n
  <chr>                       <int>
1 single_family                7101
2 condos                       1702
3 townhomes                     762
4 land                          641
5 mobile                        560
6 multi_family                  111
7 condo_townhome_rowhome_coop    13
8 farm                            3

fct-lump function

From the dataset overview above, we can see there are many rare levels. All those levels will make plots become unnecessarly cluttered. Fct_lump_n() can help keep most frequent levels and collapses everything else into ‘Other’.

re_lumped <- re |> 
  mutate(type_lumped = fct_lump_n(type, n = 5))  #using top 5 for this demo.

re_lumped |>
  count(type_lumped, sort = TRUE)
# A tibble: 6 × 2
  type_lumped       n
  <fct>         <int>
1 single_family  7101
2 condos         1702
3 townhomes       762
4 land            641
5 mobile          560
6 Other           127

As we can see, the list is cleaner now compare to before. Keeping only the top 5 types gives us a clean, interpretable chart without losing meaningful categories.

fct_reorder() function

Bar charts are much easier to read when bars are sorted by a meaningful numeric variable instead of alphabetical order. fct_reorder() reorders a factor’s levels based on a summary statistic of another variable.

re_lumped |>
  group_by(type_lumped) |>
  summarise(median_price = median(lastSoldPrice, na.rm = TRUE)) |>
  mutate(type_lumped = fct_reorder(type_lumped, median_price)) |>
  ggplot(aes(x = type_lumped, y = median_price)) +
  geom_col(fill = "#E84855") +
  scale_y_continuous(labels = dollar_format()) +
  coord_flip() +
  labs(
    title    = "Median Sale Price by Property Type",
    subtitle = "using fct_reorder() sorts bars by median price",
    x        = NULL,
    y        = "Median Sale Price (USD)"
  ) +
  theme_minimal(base_size = 13)

fct_collapse() function

Similar to fct_lump, another way to consolidate categories into a broader group, driven by numbers, is to use fct_collapse() function. Difference between the two are fct_lump is more data driven regrouping, whereas fct_lump regroup by user defined rules. Here is the example of fct_collapse() function simplify label - only ‘condo’ and ‘townhouse’.

re_collapsed <- re |>
  filter(!is.na(sub_type)) |>
  mutate(
    sub_type     = as.factor(sub_type),
    sub_type_grp = fct_collapse(sub_type,
      "Attached Unit"     = c("Condo", "Villa", "Townhouse", "Half Duplex"),
      "Single Family"     = c("Single Family Residence", "Manufactured Home"),
      "Land / Other"      = c("Unimproved Land", "Farm", "Agricultural")
    )
  )

re_collapsed |>
  count(sub_type_grp, sort = TRUE)
# A tibble: 2 × 2
  sub_type_grp     n
  <fct>        <int>
1 condo         1715
2 townhouse      761

fct_na_value_to_level() function

By default, R drops rows with NA factor levels from plots and summaries — silently hiding a potentially large and meaningful segment. fct_na_value_to_level() promotes NA into a proper level so it appears in outputs and can be analyzed.

re_na <- re |>
  mutate(
    sub_type_f  = as.factor(sub_type),
    sub_type_f  = fct_lump_n(sub_type_f, n = 4),   # keep top 4 + Other
    sub_type_f  = fct_na_value_to_level(sub_type_f, level = "Not Specified")
  )

re_na |> count(sub_type_f, sort = TRUE)
# A tibble: 3 × 2
  sub_type_f        n
  <fct>         <int>
1 Not Specified  8417
2 condo          1715
3 townhouse       761
re_na |>
  group_by(sub_type_f) |>
  summarise(median_price = median(lastSoldPrice, na.rm = TRUE)) |>
  mutate(sub_type_f = fct_reorder(sub_type_f, median_price)) |>
  ggplot(aes(x = sub_type_f, y = median_price,
             fill = sub_type_f == "Not Specified")) +
  geom_col() +
  scale_y_continuous(labels = dollar_format()) +
  scale_fill_manual(values = c("#2E86AB", "#AAAAAA"),
                    labels  = c("Known sub-type", "Was NA")) +
  coord_flip() +
  labs(
    title    = "Median Sale Price — Including 'Not Specified' Sub-Type",
    x        = NULL,
    y        = "Median Sale Price (USD)",
    fill     = NULL
  ) +
  theme_minimal(base_size = 13)

Summary

The forcats package transforms messy, poorly ordered categorical columns into simplified, organized format. The 4 functions descirbed in this vignette: fct_lump_n, fct_reorder, fct_collapse and fct_na_value_to_level could be a practical tools to transform data and provide accurate and readable data analyses.