An important function of the TWIG database is to allow users to assess the costs of fuel treatments across broad landscapes. However, whether you take FACTS Hazardous Fuels or FACTS Common Attributes as your source, a quick glance at the raw data reveals some odd values.
This is troublesome, especially for the Area Summary reports. If one of the records includes a big error, it will throw off the numbers for any Area Summary report that includes that record. Because of the frequency of these errors, most State-level reports are likely to be very inaccurate.
For example, check out this polygon with a cost of $821k per acre:
most_expensive <- hf %>% arrange(desc(COST_PER_UOM)) %>% head(1)
most_expensive %>%
select(USFS_REGION = PROC_REGION_CODE, NAME, COST_PER_UOM, UOM = UOM.x,
GIS_ACRES = GIS_ACRES.y) %>%
knitr::kable()
USFS_REGION | NAME | COST_PER_UOM | UOM | GIS_ACRES |
---|---|---|---|---|
5 | CAMINO CIELO CDZ | 821228 | ACRES | 869.059 |
Yes, Santa Barbara is expensive, but not that expensive. In this case, the record denotes a planned treatment with no completion date. There is a very similar record that does have a treatment date, and a more reasonable cost of $300 per acre.
Because the erroneous record does not have a completion date, it does not feed up into the automatically generated reports on the TWIG viewer, e.g. the area summary for Los Padres NF. I believe the reports only include records with a completion date in the specified range. Because this particular record is excluded, this error might be a bit less serious from a reporting perspective, but it’s still a problem for anyone who wants to download the data and do their own analysis.
However, there are errors in completed treatments, too - and these are affecting Area Summary reports. For example, check out this polygon in southern Utah:
most_expensive_completed <- hf %>%
filter(!is.na(DATE_COMPLETED.y),
ACTIVITY_CODE.y %ni% c(1117)) %>%
arrange(desc(COST_PER_UOM)) %>% head(1)
most_expensive_completed %>%
select(USFS_REGION = PROC_REGION_CODE, NAME, COST_PER_UOM, UOM = UOM.x,
GIS_ACRES = GIS_ACRES.y, DATE_COMPLETED = DATE_COMPLETED.y) %>%
knitr::kable()
USFS_REGION | NAME | COST_PER_UOM | UOM | GIS_ACRES | DATE_COMPLETED |
---|---|---|---|---|---|
4 | GUM HILL CHAINING MAINTENANCE 22 | 405451 | ACRES | 1728.946 | 2023-02-01 |
This PJ chaining treatment cost $405k per acre. That’s just to drag a chain across the ground. This seems like a case where the total cost was entered, instead of the cost per acre.
most_expensive_completed$COST_PER_UOM / most_expensive_completed$GIS_ACRES.y
## [1] 234.5076
$234/acre seems more reasonable. That would be a total cost of $405k. Instead, we have a total cost of:
most_expensive_completed$COST_PER_UOM * most_expensive_completed$GIS_ACRES.y
## [1] 701002885
Yes, that’s $701 million.
This error is feeding up into the report for the Dixie National
Forest.
The treatment is there, and the report lists the total spend as $749 million. The report also lists the exact BIL funding as $700,915,411, which is within rounding error of $701002885. Just to double-check that this is indeed a rounding error, I downloaded the data straight from TWIG and calculated the costs again.
dixie <- fread("treatment_index_facts_nfpors_selected.csv")
# isolate the completed, BIL funded treatments (there's only one, the one in
# question)
bil <- dixie %>% filter(fund_source == "BIL",
date_source == "date_completed")
# compare original HAZARDOUS FUELS area with TWIG area
most_expensive_completed$GIS_ACRES.y
## [1] 1728.946
bil$acres
## [1] 1728.9
# compare original HAZARDOUS FUELS cost per acre with TWIG cost per acre
most_expensive_completed$COST_PER_UOM
## [1] 405451
bil$cost_per_uom
## [1] 405451
# compare total costs
most_expensive_completed$COST_PER_UOM * most_expensive_completed$GIS_ACRES.y
## [1] 701002885
bil$cost_per_uom * bil$acres
## [1] 700984234
So, there you go, case closed. TWIG just rounds the value for acres. How bad is this error? Well, the total spend in the report is $749 million, and we know $701 million is likely erroneous. So, we’re just off by … 1400%.
Is it possible to know for sure which values are errors? No, but we can probably assume that the most extreme values accidental typos. But what is “extreme?” To understand that, we need to understand the full distribution of cost per acre values.
Note: The data does not include a “cost per acre” field, but rather a “cost per unit” field. This field is usually “per acre”, but sometimes it’s “per each”, and sometimes it is unspecified (NA). So first we need to make sure we’re dealing with the right units.
hf %>%
filter(
# check that we're looking at a consistent UOM
UOM.y == "ACRES",
# check that we're looking at a valid, non-zero cost
COST_PER_UOM > 0 & !is.na(COST_PER_UOM)
) %>%
ggplot(aes(x = COST_PER_UOM)) +
geom_histogram(bins = 1000) +
scale_x_continuous(labels = scales::dollar) +
labs(title = "Distribution of cost per acre values in Hazardous Fuels data",
x = "Cost per acre",
y = "Count")
There is a long tail of high values. The vast majority of treatments are under $1000 per acre, but there are a few treatments that are much more expensive. The extreme range of values makes the histogram difficult to read, so let’s try breaking it into chunks. First, the range of values from $0 to $5000:
hf %>%
filter(
# check that we're looking at a consistent UOM
UOM.y == "ACRES",
# check that we're looking at a valid, non-zero cost
COST_PER_UOM > 0 &
COST_PER_UOM < 5000 &
!is.na(COST_PER_UOM)
) %>%
ggplot(aes(x = COST_PER_UOM)) +
geom_histogram(bins = 500) +
scale_x_continuous(labels = scales::dollar) +
labs(title = "Distribution of cost per acre values in Hazardous Fuels data",
subtitle = "Values less than $5000 only",
x = "Cost per acre",
y = "Count")
Now, for the remaining data (the extreme values):
hf %>%
filter(
# check that we're looking at a consistent UOM
UOM.y == "ACRES",
# check that we're looking at a valid, non-zero cost
COST_PER_UOM > 5000 &
!is.na(COST_PER_UOM)
) %>%
ggplot(aes(x = COST_PER_UOM)) +
geom_histogram(bins = 500) +
scale_x_continuous(labels = scales::dollar) +
labs(title = "Distribution of cost per acre values in Hazardous Fuels data",
subtitle = "Values greater than $5000 only",
x = "Cost per acre",
y = "Count") +
geom_vline(xintercept = 50000, color = "red") +
geom_vline(xintercept = 75000, color = "red") +
geom_vline(xintercept = 200000, color = "red") +
annotate("text", x = 58000, y = 140, angle = 90,
label = "cost of sod grass, installed", color = "red") +
annotate("text", x = 82000, y = 100, angle = 90,
label = "cost of carpet, installed", color = "red") +
annotate("text", x = 208000, y = 80, angle = 90,
label = "cost of NICE carpet, installed", color = "red")
There is still an extraordinary right skew, but if you look closely, you can see some small bars on the right end of the graph. Those are single records with extremely high values. You might even be able to see that there are, in fact, 20 treatments that are apparently more expensive than the nicest carpet you can buy at Home Depot ($4 per square foot, with a 61 cent installation charge, which works out to about $200k/acre).
nrow(hf %>% filter(UOM.y == "ACRES", COST_PER_UOM > 200000))
## [1] 20
There are 83 treatments that are more expensive than cheap carpet, and 129 treatments that are more expensive than installing sod grass.
nrow(hf %>% filter(UOM.y == "ACRES", COST_PER_UOM > 50000))
## [1] 129
nrow(hf %>% filter(UOM.y == "ACRES", COST_PER_UOM > 75000))
## [1] 83
You don’t even really need to know the distribution of cost data to know that a fuels treatment should probably be cheaper than a lawn. But, just to be safe, let’s examine a few treatments that are on the bubble of being more expensive than sod grass. For this, I’ll look for records that are “otherwise normal”, i.e. not old or brand new, not Wildland Fire Use, and don’t have a bunch of missing fields.
hf %>%
# parse date
mutate(date_completed = as.Date(DATE_COMPLETED.y, format = "%m/%d/%Y")) %>%
# use some filters to get a few "reasonable" records
filter(
UOM.y == "ACRES",
COST_PER_UOM < 50000,
date_completed > as.Date("2020-01-01") &
date_completed < as.Date("2024-06-01")
) %>%
select(NAME, ACTIVITY = ACTIVITY.y, COST_PER_UOM, GIS_ACRES = GIS_ACRES.y, date_completed, PROC_REGION_CODE, SUID.y) %>%
arrange(desc(COST_PER_UOM)) %>%
head(3)
## NAME
## <char>
## 1: UNITS A, D, C
## 2: ANTELOPE FLAT VEG MANAGEMENT
## 3: AVINTAQUIN FOSSIL RIDGE LS 21
## ACTIVITY COST_PER_UOM GIS_ACRES
## <char> <num> <num>
## 1: Broadcast Burning - Covers a majority of the unit 44583 496.918
## 2: Rearrangement of Fuels 42185 1477.396
## 3: Thinning for Hazardous Fuels Reduction 39997 666.839
## date_completed PROC_REGION_CODE SUID.y
## <Date> <int> <char>
## 1: 2023-05-03 2 021503AVALANCHE2000
## 2: 2020-10-05 4 040101040101AFVM000
## 3: 2020-10-16 4 040104F4AVINTAFO001
Here, we have a broadcast burn near Aspen, Colorado, a lop-and-scatter near Flaming Gorge, UT, and a thinning treatment in Central Utah.
The broadcast burn near Aspen is colocated with another broadcast burn (UNIT D) which cost only $62 per acre.
If we assume the “COST_PER_UOM” field was entered as the total cost, then the cost per acre would be a more reasonable $90 per acre. This is at least within the ballpark of reasonable values, and it’s my guess that this is the correct value.
The lop-and-scatter near Flaming Gorge is the only recent treatment of its type in the area, but the calculated total cost of $62 million is roughly 5% of the total spend in the Ashley National Forest from 2020-2024.
I can’t tell whether this is a data entry error or a tragic use of taxpayer money. I’m hoping it’s an error, but it doesn’t seem like the typical error of entering total cost as COST_PER_UOM - that would mean the real cost per acre was just $29. That’s about 1/2 to 1/3 of the “typical” cost.
Finally, a 666-acre thinning treatment in Central Utah that cost $26 million: