Purpose: examine possible errors in FACTS cost data

An important function of the TWIG database is to allow users to assess the costs of fuel treatments across broad landscapes. However, whether you take FACTS Hazardous Fuels or FACTS Common Attributes as your source, a quick glance at the raw data reveals some odd values.

This is troublesome, especially for the Area Summary reports. If one of the records includes a big error, it will throw off the numbers for any Area Summary report that includes that record. Because of the frequency of these errors, most State-level reports are likely to be very inaccurate.

Examples of bad data

For example, check out this polygon with a cost of $821k per acre:

most_expensive <- hf %>% arrange(desc(COST_PER_UOM)) %>% head(1)

most_expensive %>% 
  select(USFS_REGION = PROC_REGION_CODE, NAME, COST_PER_UOM, UOM = UOM.x, 
         GIS_ACRES = GIS_ACRES.y) %>% 
  knitr::kable()
USFS_REGION NAME COST_PER_UOM UOM GIS_ACRES
5 CAMINO CIELO CDZ 821228 ACRES 869.059
Most expensive polygon
Most expensive polygon

Yes, Santa Barbara is expensive, but not that expensive. In this case, the record denotes a planned treatment with no completion date. There is a very similar record that does have a treatment date, and a more reasonable cost of $300 per acre.

The “real” polygon has a lower cost of $300 per acre
The “real” polygon has a lower cost of $300 per acre

Because the erroneous record does not have a completion date, it does not feed up into the automatically generated reports on the TWIG viewer, e.g. the area summary for Los Padres NF. I believe the reports only include records with a completion date in the specified range. Because this particular record is excluded, this error might be a bit less serious from a reporting perspective, but it’s still a problem for anyone who wants to download the data and do their own analysis.

However, there are errors in completed treatments, too - and these are affecting Area Summary reports. For example, check out this polygon in southern Utah:

most_expensive_completed <- hf %>% 
  filter(!is.na(DATE_COMPLETED.y),
         ACTIVITY_CODE.y %ni% c(1117)) %>%
  arrange(desc(COST_PER_UOM)) %>% head(1)

most_expensive_completed %>%
  select(USFS_REGION = PROC_REGION_CODE, NAME, COST_PER_UOM, UOM = UOM.x, 
         GIS_ACRES = GIS_ACRES.y, DATE_COMPLETED = DATE_COMPLETED.y) %>% 
  knitr::kable()
USFS_REGION NAME COST_PER_UOM UOM GIS_ACRES DATE_COMPLETED
4 GUM HILL CHAINING MAINTENANCE 22 405451 ACRES 1728.946 2023-02-01
Most expensive completed polygon
Most expensive completed polygon

This PJ chaining treatment cost $405k per acre. That’s just to drag a chain across the ground. This seems like a case where the total cost was entered, instead of the cost per acre.

most_expensive_completed$COST_PER_UOM / most_expensive_completed$GIS_ACRES.y
## [1] 234.5076

$234/acre seems more reasonable. That would be a total cost of $405k. Instead, we have a total cost of:

most_expensive_completed$COST_PER_UOM * most_expensive_completed$GIS_ACRES.y
## [1] 701002885

Yes, that’s $701 million.

This error is feeding up into the report for the Dixie National Forest. Dixie NF report

The treatment is there, and the report lists the total spend as $749 million. The report also lists the exact BIL funding as $700,915,411, which is within rounding error of $701002885. Just to double-check that this is indeed a rounding error, I downloaded the data straight from TWIG and calculated the costs again.

dixie <- fread("treatment_index_facts_nfpors_selected.csv")

# isolate the completed, BIL funded treatments (there's only one, the one in
# question)
bil <- dixie %>% filter(fund_source == "BIL", 
                        date_source == "date_completed")

# compare original HAZARDOUS FUELS area with TWIG area
most_expensive_completed$GIS_ACRES.y
## [1] 1728.946
bil$acres
## [1] 1728.9
# compare original HAZARDOUS FUELS cost per acre with TWIG cost per acre
most_expensive_completed$COST_PER_UOM
## [1] 405451
bil$cost_per_uom
## [1] 405451
# compare total costs
most_expensive_completed$COST_PER_UOM * most_expensive_completed$GIS_ACRES.y
## [1] 701002885
bil$cost_per_uom * bil$acres
## [1] 700984234

So, there you go, case closed. TWIG just rounds the value for acres. How bad is this error? Well, the total spend in the report is $749 million, and we know $701 million is likely erroneous. So, we’re just off by … 1400%.

Identifying likely errors

Is it possible to know for sure which values are errors? No, but we can probably assume that the most extreme values accidental typos. But what is “extreme?” To understand that, we need to understand the full distribution of cost per acre values.

Note: The data does not include a “cost per acre” field, but rather a “cost per unit” field. This field is usually “per acre”, but sometimes it’s “per each”, and sometimes it is unspecified (NA). So first we need to make sure we’re dealing with the right units.

hf %>% 
  filter(
    # check that we're looking at a consistent UOM
    UOM.y == "ACRES", 
    # check that we're looking at a valid, non-zero cost
    COST_PER_UOM > 0 & !is.na(COST_PER_UOM)
  ) %>% 
  ggplot(aes(x = COST_PER_UOM)) +
  geom_histogram(bins = 1000) +
  scale_x_continuous(labels = scales::dollar) +
  labs(title = "Distribution of cost per acre values in Hazardous Fuels data",
       x = "Cost per acre",
       y = "Count")

There is a long tail of high values. The vast majority of treatments are under $1000 per acre, but there are a few treatments that are much more expensive. The extreme range of values makes the histogram difficult to read, so let’s try breaking it into chunks. First, the range of values from $0 to $5000:

hf %>% 
  filter(
    # check that we're looking at a consistent UOM
    UOM.y == "ACRES", 
    # check that we're looking at a valid, non-zero cost
    COST_PER_UOM > 0 & 
      COST_PER_UOM < 5000 &
      !is.na(COST_PER_UOM)
  ) %>% 
  ggplot(aes(x = COST_PER_UOM)) +
  geom_histogram(bins = 500) +
  scale_x_continuous(labels = scales::dollar) +
  labs(title = "Distribution of cost per acre values in Hazardous Fuels data",
       subtitle = "Values less than $5000 only",
       x = "Cost per acre",
       y = "Count")

Now, for the remaining data (the extreme values):

hf %>% 
  filter(
    # check that we're looking at a consistent UOM
    UOM.y == "ACRES", 
    # check that we're looking at a valid, non-zero cost
    COST_PER_UOM > 5000 &
      !is.na(COST_PER_UOM)
  ) %>% 
  ggplot(aes(x = COST_PER_UOM)) +
  geom_histogram(bins = 500) +
  scale_x_continuous(labels = scales::dollar) +
  labs(title = "Distribution of cost per acre values in Hazardous Fuels data",
       subtitle = "Values greater than $5000 only",
       x = "Cost per acre",
       y = "Count") +
  geom_vline(xintercept = 50000, color = "red") +
  geom_vline(xintercept = 75000, color = "red") +
  geom_vline(xintercept = 200000, color = "red") +
  annotate("text", x = 58000, y = 140, angle = 90, 
           label = "cost of sod grass, installed", color = "red") +
  annotate("text", x = 82000, y = 100, angle = 90, 
           label = "cost of carpet, installed", color = "red") +
annotate("text", x = 208000, y = 80, angle = 90, 
           label = "cost of NICE carpet, installed", color = "red") 

There is still an extraordinary right skew, but if you look closely, you can see some small bars on the right end of the graph. Those are single records with extremely high values. You might even be able to see that there are, in fact, 20 treatments that are apparently more expensive than the nicest carpet you can buy at Home Depot ($4 per square foot, with a 61 cent installation charge, which works out to about $200k/acre).

nrow(hf %>% filter(UOM.y == "ACRES", COST_PER_UOM > 200000))
## [1] 20

There are 83 treatments that are more expensive than cheap carpet, and 129 treatments that are more expensive than installing sod grass.

nrow(hf %>% filter(UOM.y == "ACRES", COST_PER_UOM > 50000))
## [1] 129
nrow(hf %>% filter(UOM.y == "ACRES", COST_PER_UOM > 75000))
## [1] 83

You don’t even really need to know the distribution of cost data to know that a fuels treatment should probably be cheaper than a lawn. But, just to be safe, let’s examine a few treatments that are on the bubble of being more expensive than sod grass. For this, I’ll look for records that are “otherwise normal”, i.e. not old or brand new, not Wildland Fire Use, and don’t have a bunch of missing fields.

hf %>% 
  # parse date
  mutate(date_completed = as.Date(DATE_COMPLETED.y, format = "%m/%d/%Y")) %>%
  # use some filters to get a few "reasonable" records
  filter(
    UOM.y == "ACRES", 
    COST_PER_UOM < 50000,
    date_completed > as.Date("2020-01-01") &
      date_completed < as.Date("2024-06-01")
  ) %>% 
  select(NAME, ACTIVITY = ACTIVITY.y, COST_PER_UOM, GIS_ACRES = GIS_ACRES.y, date_completed, PROC_REGION_CODE, SUID.y) %>% 
  arrange(desc(COST_PER_UOM)) %>%
  head(3)
##                             NAME
##                           <char>
## 1:                 UNITS A, D, C
## 2:  ANTELOPE FLAT VEG MANAGEMENT
## 3: AVINTAQUIN FOSSIL RIDGE LS 21
##                                             ACTIVITY COST_PER_UOM GIS_ACRES
##                                               <char>        <num>     <num>
## 1: Broadcast Burning - Covers a majority of the unit        44583   496.918
## 2:                            Rearrangement of Fuels        42185  1477.396
## 3:            Thinning for Hazardous Fuels Reduction        39997   666.839
##    date_completed PROC_REGION_CODE              SUID.y
##            <Date>            <int>              <char>
## 1:     2023-05-03                2 021503AVALANCHE2000
## 2:     2020-10-05                4 040101040101AFVM000
## 3:     2020-10-16                4 040104F4AVINTAFO001

Here, we have a broadcast burn near Aspen, Colorado, a lop-and-scatter near Flaming Gorge, UT, and a thinning treatment in Central Utah.

The broadcast burn near Aspen is colocated with another broadcast burn (UNIT D) which cost only $62 per acre.

Broadcast burn near Aspen
Broadcast burn near Aspen

If we assume the “COST_PER_UOM” field was entered as the total cost, then the cost per acre would be a more reasonable $90 per acre. This is at least within the ballpark of reasonable values, and it’s my guess that this is the correct value.

The lop-and-scatter near Flaming Gorge is the only recent treatment of its type in the area, but the calculated total cost of $62 million is roughly 5% of the total spend in the Ashley National Forest from 2020-2024.

Lop-and-scatter near Flaming Gorge
Lop-and-scatter near Flaming Gorge

I can’t tell whether this is a data entry error or a tragic use of taxpayer money. I’m hoping it’s an error, but it doesn’t seem like the typical error of entering total cost as COST_PER_UOM - that would mean the real cost per acre was just $29. That’s about 1/2 to 1/3 of the “typical” cost.

Finally, a 666-acre thinning treatment in Central Utah that cost $26 million:

Thinning treatment in Central Utah
Thinning treatment in Central Utah

Another nearby thinning, seemingly from the same overarching project, cost just $110 per acre. If we assume the common “total cost for cost-per-acre” error was committed here, the real cost per acre would be $60. That is on the low side, but not unreasonable.

Takeaways

I can’t tell for sure whether these records are erroneous. It could be that the costs skyrocketed because of some accident or damage to equipment. Maybe there was a chainsaw accident, and the USFS paid out to compensate a worker who lost an arm. We really can’t know for sure.

Yet, given my prior knowledge regarding the relative probabilities of horrific accidents and data entry mistakes, I’m going to venture to guess that the vast majority of these high values are data entry mistakes.

Threshold value for error flagging

So, what should be the threshold for flagging a record as a (potential) error? One suggestion is to flag everything over 1 standard deviation from the mean. Because of the strong right skew, this isn’t as many records as you would think.

hf_summary <- hf %>% 
  filter(UOM.y == "ACRES", COST_PER_UOM > 0) %>%
  summarise(
    mean = mean(COST_PER_UOM),
    sd = sd(COST_PER_UOM),
    threshold = mean + sd
  ) 

hf_summary$threshold
## [1] 3853.774
errors <- hf %>% filter(UOM.y == "ACRES", COST_PER_UOM > hf_summary$threshold) %>% nrow()
errors
## [1] 1805
errors/nrow(hf)
## [1] 0.002865243

By my calculations, this would flag 1,805 records with a cost per acre of $3853 or more. This is less than 0.03% of all hazardous fuels records. I should note that Ben’s numbers differ slightly, with a threshold $3307 excluding 2,437 records. This is because Ben did not exclude records with a cost per acre of 0 before calculating the mean and standard deviation.

Another way to use the standard deviation is to assess total costs, by multiplying the cost per acre by the acreage.

hf_summary_acres <- hf %>% 
  filter(UOM.y == "ACRES", COST_PER_UOM > 0) %>% 
  summarise(
    mean = mean(COST_PER_UOM * GIS_ACRES.y, na.rm = TRUE),
    sd = sd(COST_PER_UOM * GIS_ACRES.y, na.rm = TRUE),
    threshold = mean + sd
  )

hf_summary_acres$threshold
## [1] 4785724
errors_acres <- hf %>% filter(UOM.y == "ACRES", COST_PER_UOM * GIS_ACRES.y > hf_summary_acres$threshold) %>% nrow()
errors_acres
## [1] 376
errors_acres/nrow(hf)
## [1] 0.0005968595

This threshold is much more conservative, flagging only 376 records. This is because some of the records with high cost per acre values are actually very small treatments.

Another approach would be to flag a certain percentile of records. For example, we could flag the top 1% most expensive records.

hf_summary_percentile <- hf %>% 
  filter(UOM.y == "ACRES", COST_PER_UOM > 0) %>% 
  summarise(
    threshold = quantile(COST_PER_UOM, 0.99, na.rm = TRUE)
  )

hf_summary_percentile$threshold
## [1] 2239.9
errors_percentile <- hf %>% filter(UOM.y == "ACRES", COST_PER_UOM > hf_summary_percentile$threshold) %>% nrow()
errors_percentile
## [1] 4193
errors_percentile/nrow(hf)
## [1] 0.006655936

This flags 4,193 records, or 0.07% of all hazardous fuels records (or, by definition, 1% of records after excluding cases with missing or zero cost values).

Yet another way to approach the issue is to look for a natural break in the data, and reason from there.

# mean
hf_summary_natural <- hf %>% 
  filter(UOM.y == "ACRES", COST_PER_UOM > 0) %>% 
  summarise(
    threshold = quantile(COST_PER_UOM, 0.95, na.rm = TRUE)
  )

hf %>% 
  filter(
    # check that we're looking at a consistent UOM
    UOM.y == "ACRES", 
    # check that we're looking at a valid, non-zero cost
    COST_PER_UOM > 0 & 
      COST_PER_UOM < 5000 &
      !is.na(COST_PER_UOM)
  ) %>% 
  ggplot(aes(x = COST_PER_UOM)) +
  geom_histogram(bins = 500) +
  scale_x_continuous(labels = scales::dollar) +
  labs(title = "Distribution of cost per acre values in Hazardous Fuels data",
       subtitle = "Values less than $5000 only",
       x = "Cost per acre",
       y = "Count")

What threshold sounds reasonable to you ? Pick a threshold and let me know what it is!