re <- read_csv("https://raw.githubusercontent.com/xiaofeimei1/datasetPublic/refs/heads/main/florida_real_estate_sold_properties_ultimate.csv")XiaoFeiMei_Forcats_package_Vignette
Introduction
The task is to practice collaborating around a code project with GitHub. Each DATA607 student create a vignette about one or more TidyVerse package with actual usage examples. Together we collective work as building out a book of examples on how to use TidyVerse functions.
The package I choose is the forcats package, specifically the following functions: fct_lump_n, fct_reorder, fct_collapse and fct_na_value_to_level.
To demonstrate usage of discussed function, I’ll be using a dataset of sold residential properties in Florida sourced from Kaggle. The dataset contains listing prices, sale prices, square footage, and categorical variables like type and sub_type that describe each property.
Import Data
A copy of csv file in a separate repository that’s access to public. Anyone would be able to directly load the csv file by running this code with the URL below.
Check unique property types from the dataset.
re |> count(type, sort = TRUE)# A tibble: 8 × 2
type n
<chr> <int>
1 single_family 7101
2 condos 1702
3 townhomes 762
4 land 641
5 mobile 560
6 multi_family 111
7 condo_townhome_rowhome_coop 13
8 farm 3
fct-lump function
From the dataset overview above, we can see there are many rare levels. All those levels will make plots become unnecessarly cluttered. Fct_lump_n() can help keep most frequent levels and collapses everything else into ‘Other’.
re_lumped <- re |>
mutate(type_lumped = fct_lump_n(type, n = 5)) #using top 5 for this demo.
re_lumped |>
count(type_lumped, sort = TRUE)# A tibble: 6 × 2
type_lumped n
<fct> <int>
1 single_family 7101
2 condos 1702
3 townhomes 762
4 land 641
5 mobile 560
6 Other 127
As we can see, the list is cleaner now compare to before. Keeping only the top 5 types gives us a clean, interpretable chart without losing meaningful categories.
fct_reorder() function
Bar charts are much easier to read when bars are sorted by a meaningful numeric variable instead of alphabetical order. fct_reorder() reorders a factor’s levels based on a summary statistic of another variable.
re_lumped |>
group_by(type_lumped) |>
summarise(median_price = median(lastSoldPrice, na.rm = TRUE)) |>
mutate(type_lumped = fct_reorder(type_lumped, median_price)) |>
ggplot(aes(x = type_lumped, y = median_price)) +
geom_col(fill = "#E84855") +
scale_y_continuous(labels = dollar_format()) +
coord_flip() +
labs(
title = "Median Sale Price by Property Type",
subtitle = "using fct_reorder() sorts bars by median price",
x = NULL,
y = "Median Sale Price (USD)"
) +
theme_minimal(base_size = 13)fct_collapse() function
Similar to fct_lump, another way to consolidate categories into a broader group, driven by numbers, is to use fct_collapse() function. Difference between the two are fct_lump is more data driven regrouping, whereas fct_lump regroup by user defined rules. Here is the example of fct_collapse() function simplify label - only ‘condo’ and ‘townhouse’.
re_collapsed <- re |>
filter(!is.na(sub_type)) |>
mutate(
sub_type = as.factor(sub_type),
sub_type_grp = fct_collapse(sub_type,
"Attached Unit" = c("Condo", "Villa", "Townhouse", "Half Duplex"),
"Single Family" = c("Single Family Residence", "Manufactured Home"),
"Land / Other" = c("Unimproved Land", "Farm", "Agricultural")
)
)
re_collapsed |>
count(sub_type_grp, sort = TRUE)# A tibble: 2 × 2
sub_type_grp n
<fct> <int>
1 condo 1715
2 townhouse 761
fct_na_value_to_level() function
By default, R drops rows with NA factor levels from plots and summaries — silently hiding a potentially large and meaningful segment. fct_na_value_to_level() promotes NA into a proper level so it appears in outputs and can be analyzed.
re_na <- re |>
mutate(
sub_type_f = as.factor(sub_type),
sub_type_f = fct_lump_n(sub_type_f, n = 4), # keep top 4 + Other
sub_type_f = fct_na_value_to_level(sub_type_f, level = "Not Specified")
)
re_na |> count(sub_type_f, sort = TRUE)# A tibble: 3 × 2
sub_type_f n
<fct> <int>
1 Not Specified 8417
2 condo 1715
3 townhouse 761
re_na |>
group_by(sub_type_f) |>
summarise(median_price = median(lastSoldPrice, na.rm = TRUE)) |>
mutate(sub_type_f = fct_reorder(sub_type_f, median_price)) |>
ggplot(aes(x = sub_type_f, y = median_price,
fill = sub_type_f == "Not Specified")) +
geom_col() +
scale_y_continuous(labels = dollar_format()) +
scale_fill_manual(values = c("#2E86AB", "#AAAAAA"),
labels = c("Known sub-type", "Was NA")) +
coord_flip() +
labs(
title = "Median Sale Price — Including 'Not Specified' Sub-Type",
x = NULL,
y = "Median Sale Price (USD)",
fill = NULL
) +
theme_minimal(base_size = 13)Summary
The forcats package transforms messy, poorly ordered categorical columns into simplified, organized format. The 4 functions descirbed in this vignette: fct_lump_n, fct_reorder, fct_collapse and fct_na_value_to_level could be a practical tools to transform data and provide accurate and readable data analyses.