A Critical Analysis of Oregon State’s Wildfire Hazard Model
Examining the probabilities and performance in forecasting wildfire occurrences
Curated by: Jeremy Kauwe
Date:
2/5/2025
In public policy, models are expected to provide reliable, evidence-based insights that inform critical decisions. They must be rigorously validated and demonstrate predictive performance well above trivial baselines. It is unacceptable for a model—especially one that influences resource allocation and public safety—to perform on par with a random number generator.
Our long-run evaluation of Oregon State’s Wildfire Hazard Model, using 21 years of wildfire data, reveals critical failures that undermine its credibility as a predictive tool:
These findings raise serious questions about the validity of the methods used to develop and assess the model’s outputs. For a tool intended to guide public policy, the expectation is clear: models must deliver actionable, robust, and clearly superior predictions—not results that mimic random chance. Without fundamental improvements, the Oregon Wildfire Hazard Model risks misinforming decision-makers, misallocating resources, and eroding public trust in wildfire risk assessments.
Wildfire hazard assessment is a critical tool for land management, public policy, and disaster preparedness. These assessments rely on models designed to estimate the likelihood and potential impact of wildfires based on environmental conditions, historical data, and fire behavior simulations. A fundamental expectation of these models is their ability to accurately distinguish between areas historically prone to wildfires and those with little to no fire history.
A key component of wildfire risk assessments is burn probability, which is defined as the average annual likelihood that a specific location will experience wildfire. Burn probabilities are expressed as fractions, where a value of 0.01 represents a 1% chance of fire in any given year, or one expected fire every 100 years on average. These probabilities are long-term averages, not short-term forecasts, and are used alongside fire intensity information to determine which landscapes are more likely to experience wildfire hazard.
However, burn probability is only one part of the overall wildfire hazard model. The final wildfire hazard output integrates burn probability with fire intensity modifiers and assigns risk classifications into three designated tiers:
These hazard classifications were initially developed using hazard values near structures within the Wildland-Urban Interface (WUI). Researchers combined burn probability with fire intensity at a 30 x 30-meter pixel scale, averaged the values within a three-cell neighborhood, and extracted 792,949 hazard values associated with structures in the WUI. The hazard thresholds for moderate and high hazard zones were based on the 40th and 90th percentile of these values, respectively. These thresholds were recommended for adoption by the Rules Advisory Committee in February 2022 and later formally adopted by the Board of Forestry in June 2022.
This report evaluates two critical aspects of the wildfire hazard model:
This analysis examines model outputs across the entire state of Oregon, assessing whether the wildfire hazard model meets the necessary standards for accurate classification, predictive reliability, and policy relevance in wildfire risk assessment.
The data used in this analysis comes from multiple sources, including model outputs from Oregon State University, historical wildfire occurrence records, and a transformed dataset created by overlapping these sources. These datasets were obtained through Oregon State’s Wildfire Hazard Risk Point of Contact (POC) and serve as the foundation for evaluating both the burn probability component and the final wildfire hazard classifications of the wildfire hazard model.
This dataset is the result of integrating data from both the Wildfire Hazard Model Data and Historical Fire Data. It combines key elements necessary for evaluating the wildfire hazard model:
Each row in this dataset represents a single pixel where these four data sources overlap. The spatial resolution is 500 feet, with each pixel spaced 500 feet apart both vertically and horizontally, ensuring a uniform grid structure.
To better illustrate the structure and granularity of the dataset, the following figures provide a zoomed-in visualization of the spatial data points. Each chart progressively zooms in to show how the dataset is structured at different levels of detail.
The first chart displays the entire dataset, highlighting the spatial distribution of data points across the region.
This second chart provides a mid-level zoom, showing how individual data points become more distinguishable.
The final chart provides a close-up of a small section of the dataset, emphasizing the 500-foot pixel resolution and illustrating the spatial density of the points.
These figures help to contextualize the dataset, ensuring that readers understand the scale and resolution of the data being analyzed.
This analysis focuses on a subset of the transformed dataset to examine burn probability distributions, fire occurrence patterns, and final wildfire hazard classifications.
By analyzing this dataset, we assess whether burn probability effectively distinguishes between fire-prone and non-fire areas and whether the final wildfire hazard classifications correspond to actual fire history.
This dataset, provided by Oregon State University (OSU), was used to build the wildfire hazard model. It contains burn probability outputs and the final wildfire hazard values, which were extracted for analysis.
Data Source: Oregon State University
Path to Final Burn Probabilities:
SB80PublicData >> FireModelingData >> FireModeling_FuelscapeData.gdb >> BurnProbability
Path to Wildfire Hazards:
SB80PublicData >> FireModelingData >> FireModeling_FuelscapeData.gdb >> WildfireHazard
This dataset provides both the raw burn probability values and the final hazard classifications, which were later overlaid with historical fire records to evaluate their predictive accuracy.
This dataset, provided by Oregon State University (OSU), contains recorded wildfire events from 2000 to 2021 and was used to validate the model’s predictive accuracy.
This dataset was overlaid with the Wildfire Hazard Model Data to create the Transformed Dataset, allowing an evaluation of whether burn probability aligns with historical fire occurrence patterns and whether hazard classifications correspond to actual wildfire risk.
A full list of dataset links and additional details can be found in the GitHub README. For further information or verification, inquiries can be directed to OSUwildfirerisk@oregonstate.edu.
The chart represents probability values using a grayscale gradient, where lower probabilities are darker (black) and higher probabilities are lighter (white).
The wildfire hazard map uses a grayscale gradient to represent hazard values, where lower hazard areas appear darker (black) and higher hazard areas appear lighter (white). This visualization helps distinguish varying levels of wildfire hazard based on model outputs.
The hazard map above provides a visual representation of wildfire risk levels across the study area, highlighting low to high hazard zones using grayscale intensity.
This map overlays the model’s burn probability layer with historical fire occurrences (2000-2021), highlighting areas where the model predicted fire risk versus actual fire events. The gradient color is just to make it more visually appealing
This section outlines the data preparation steps used to construct the final dataset for analysis.
For each grid point, the following values were
extracted: - Burn Probability Value from the model
output. - Wildfire Hazard Value from the model output.
- Fire Occurrence Flag assigned based on historical
wildfire data: - 1 if a fire was recorded at that location
(after merging overlapping fires). - 0 if no fire was
recorded.
0 for fire occurrence.Wildfire hazard values were categorized into three risk bands: - High Hazard: Hazard value > 0.137872 - Moderate Hazard: Hazard value > 0.001911 and ≤ 0.137872 - Low Hazard: Hazard value ≤ 0.001911
To evaluate how well the model predicts fire occurrences, we conducted several tests. These tests help determine whether the model provides meaningful predictions or if its results are no better than random guessing. Below, we explain these tests in simple terms and why they matter.
Predicting fire occurrences requires a balance between two key factors: precision and recall. Additionally, we use the Precision-Recall Curve and its Area Under the Curve (PR-AUC) to assess the model’s overall effectiveness.
When predicting fire occurrences, we measure how well the model identifies fire-prone areas using two key metrics:
We need a balance, which leads us to the Precision-Recall Curve.
Since precision and recall change depending on the model’s cut off, we plot a Precision-Recall Curve to analyze its performance across different thresholds.
Since PR-AUC is particularly useful for imbalanced datasets, it is important to compare model performance to known industry standards:
To test whether the model is actually predicting fires or just guessing, we compared it to a random number generator that assigns probabilities at random.
To further evaluate the model, we use histogram overlays and box and whisker plots to compare probability distributions for locations where fires occurred and where they did not.
This boxplot and histogram analysis helps determine whether the model assigns meaningfully different probabilities to fire-prone areas or if probability values are too similar across all locations, limiting the model’s usefulness.
To quantify how much the model overestimates fire risk, we calculate the overestimation rate, which represents the proportion of predicted fires that never actually occurred.
\[ \text{Overestimation Rate} = \frac{\text{False Positives (FP)}}{\text{True Positives (TP)} + \text{False Positives (FP)}} \]
This value represents the percentage of predicted fires that did not happen.
Since Precision is defined as:
\[ \text{Precision} = \frac{\text{True Positives (TP)}}{\text{True Positives (TP)} + \text{False Positives (FP)}} \]
We can equivalently express overestimation as:
\[ \text{Overestimation Rate} = 1 - \text{Precision} \]
This shows that as precision improves, the overestimation rate decreases, meaning fewer false alarms.
We start with the definition of Precision:
\[ \text{Precision} = \frac{\text{True Positives (TP)}}{\text{True Positives (TP)} + \text{False Positives (FP)}} \]
Now, subtract Precision from 1:
\[ 1 - \text{Precision} = 1 - \frac{\text{TP}}{\text{TP} + \text{FP}} \]
Express 1 as a fraction with the same denominator:
\[ 1 - \text{Precision} = \frac{\text{TP} + \text{FP}}{\text{TP} + \text{FP}} - \frac{\text{TP}}{\text{TP} + \text{FP}} \]
Since both terms have the same denominator, we subtract the numerators:
\[ 1 - \text{Precision} = \frac{(\text{TP} + \text{FP}) - \text{TP}}{\text{TP} + \text{FP}} \]
Simplify the numerator:
\[ 1 - \text{Precision} = \frac{\text{FP}}{\text{TP} + \text{FP}} \]
But this is exactly the formula for the Overestimation Rate:
\[ \text{Overestimation Rate} = \frac{\text{False Positives (FP)}}{\text{True Positives (TP)} + \text{False Positives (FP)}} \]
Thus, we have proved:
\[ 1 - \text{Precision} = \text{Overestimation Rate} \]
By measuring overestimation, we assess whether the model exaggerates wildfire likelihood, ensuring its predictions are accurate and not misleading.
To assess wildfire risk across different hazard bands, we normalize the number of pixels that experienced fire by the total number of pixels within each risk band. This provides a measure of fire occurrence per pixel, allowing for a standardized comparison across hazard classifications.
\[ \text{Relative Risk} = \frac{\text{Number of Pixels That Experienced Fire in Risk Band}}{\text{Total Pixels in Risk Band}} \]
Since the hazard model predicts that higher risk bands should experience more fires, we measure overestimation as:
\[ \text{Overestimation of Hazard Risk} = 1 - \text{Relative Risk} \]
By applying this method, we create a standardized metric to evaluate whether the model realistically reflects wildfire occurrence across different hazard classifications.
A percentile represents the relative standing of a value within a dataset, indicating the percentage of data points that fall below it. For example, a value at the 90th percentile means that 90% of the data falls below it, while a value at the 40th percentile means only 40% of the data falls below it. In a well-calibrated model, the hazard thresholds derived from the full dataset should align closely with those established using the WUI subset. If these thresholds differ significantly—such as the high hazard threshold corresponding to the 90th percentile in the WUI subset but only the 40th percentile in the full dataset—it suggests a misalignment in how hazard values are distributed across different data groupings. Ideally, a properly functioning model should produce similar thresholds regardless of whether they are derived from the WUI subset or the full dataset, ensuring consistency in hazard classification.
The burn probability model is designed to classify areas based on the likelihood of fire occurrence. Since our goal is to assess how well it differentiates between fire and no-fire events, we evaluate it as a classification model.
A key challenge in this analysis is the imbalance in the dataset, where fire events are much rarer than no-fire events:
For imbalanced datasets, standard metrics like accuracy can be misleading, since a model could predict “No Fire” for all pixels and still achieve high accuracy. Instead, we focus on Precision-Recall AUC (PR AUC) because:
To evaluate the Burn Probability Model (OSU’s model), we compute the Precision-Recall AUC (PR AUC) score on the probability output from the model trimming off the first and last datapoints.
To understand whether the model is better than random guessing, we compare its performance to a random probability generator.
When a classification model performs no better than random, its performance simply reflects the balance of the dataset. Since fires occur in 21.89% of the dataset, a completely random model’s PR AUC should be approximately 0.22.
To determine if the OSU burn probability model provides meaningful classification, we generate random probabilities and calculate the PR AUC for these random scores.
To validate this comparison, we: 1. Generate random probabilities sampled from a uniform distribution between 0 and 1. 2. Compute PR AUC for these random probabilities. 3. Compare the OSU model’s PR AUC to this random baseline.
If both values are similar, then the OSU model is functionally random and does not improve fire risk classification.
| Model | PR AUC Score |
|---|---|
| OSU Burn Probability Model | 0.22 |
| Random Probability Model | 0.22 |
This confirms that the burn probability model does not provide any additional value over a purely random model.
The Precision-Recall AUC comparison shows that the Burn Probability Model does not meaningfully distinguish fire from non-fire events.
These results indicate that the Burn Probability Model does not provide actionable fire risk predictions. If a model’s classification power is no better than random, then its outputs cannot be relied upon for decision-making.
To assess whether the model effectively differentiates between fire-prone and non-fire areas, we examine histogram overlays and box and whisker plots of assigned burn probabilities.
Each pixel in the dataset is assigned a burn probability between 0 and 1, indicating how likely the model believes a fire will occur at that location. If the model is performing well, there should be clear separation between fire and no-fire probability distributions.
A well-calibrated classification model should assign higher probabilities to fire-prone areas while keeping non-fire probabilities lower. That means:
When visualized, this would appear as:
The figures below display both histogram overlays and boxplots comparing fire vs. no-fire probabilities.
The histogram overlays the burn probability distributions for both fire and no-fire pixels.
The boxplot visualizes the spread and median values of fire and no-fire probabilities.
Since the burn probability model provides continuous probability estimates, we must set a threshold to classify whether a fire is predicted to occur.
We selected the threshold where precision was at its maximum for the PR AUC curve designating all probabilities above and equal to this as fire and all probabilities below as non-fire. We choose this because precision appears to peak around the baseline and stay there.
0.0018109127413480.2534851761282108At this threshold, the model classifies fire risk as follows:
| Metric | Value |
|---|---|
| True Negatives (TN) | 2,963,665 |
| True Positives (TP) | 2,132,259 |
| False Positives (FP) | 6,279,511 |
| False Negatives (FN) | 458,360 |
| Precision | 0.25 |
| Recall | 0.82 |
One of the most concerning findings is the high rate of false positives, leading to an overestimation of fire risk.
We calculate the Overprediction Rate as:
\[ \text{Overprediction Rate} = \frac{\text{False Positives (FP)}}{\text{False Positives (FP)} + \text{True Positives (TP)}} \]
At the selected threshold:
6,279,5112,132,25974.65%This means that over the long run (21 years of data) nearly 75% of areas classified as high fire risk did not actually experience a fire.
The model’s severe overestimation of wildfire risk has direct consequences that affect both resource allocation and public trust.
To assess the degree of overestimation in the wildfire hazard classifications, we compute the relative risk for each hazard band—extreme, moderate, and low. The relative risk is defined as the proportion of pixels within a given hazard band that have experienced fire over the 21-year study period:
\[ \text{Relative Risk} = \frac{\text{Pixels Experiencing Fire}}{\text{Total Pixels in Hazard Band}} \]
A well-calibrated model should produce relative risks that closely match the expected hazard levels. However, if a hazard classification systematically overstates fire probability, this will be evident in a large discrepancy between the assigned hazard category and the actual historical fire occurrence.
To quantify overestimation, we compute:
\[ \text{Overestimation} = 1 - \text{Relative Risk} \]
where values closer to 1 indicate a high degree of overprediction (i.e., many pixels are classified as high hazard despite rarely experiencing fire).
The results over the 21-year dataset reveal a consistent pattern of overestimation across all hazard bands:
These findings suggest that wildfire risk is persistently overstated, especially in areas labeled as “extreme hazard.” This long-run analysis raises concerns about the accuracy of hazard classifications and their implications for policy and resource allocation.
The figure below illustrates the distribution of actual fire occurrences across different hazard bands, visually demonstrating the overestimation trend.
The results indicate that fire risk classification is highly inflated, particularly in extreme hazard zones, where fire occurrence is only marginally higher than in moderate zones. A well-calibrated model should show clear progression in risk between tiers, yet the lack of separation suggests that hazard bands may not accurately reflect true fire probabilities. Future adjustments to risk thresholds should ensure that each tier corresponds to a meaningful increase in fire occurrence, avoiding unnecessary overclassification that could misallocate resources and policy efforts.
The wildfire hazard model assumes that areas within the Wildland-Urban Interface (WUI) are the most hazardous due to their proximity to human development and fire-prone landscapes. If this assumption holds, hazard scores in the WUI should generally be higher than those in the full dataset, but the percentile thresholds used for classification should remain relatively stable when applied beyond the WUI subset. Comparing the 40th and 90th percentiles—which were explicitly used to define hazard categories—allows us to test whether these thresholds scale appropriately across datasets.
The comparison reveals a significant misalignment:
These findings indicate that hazard classifications based on the WUI subset may not generalize well to broader landscapes. A well-calibrated model should maintain stable thresholds across different datasets, ensuring that hazard classifications retain their intended meaning. The fact that the full dataset’s 40th percentile aligns with the WUI subset’s 90th percentile suggests that the threshold selection process led to substantial overclassification of wildfire risk when applied statewide.
Wildfire risk models are intended to provide actionable, data-driven insights that guide public policy, emergency planning, and resource allocation. Their credibility depends on their ability to separate fire-prone areas from those unlikely to burn. However, this evaluation of the Oregon Wildfire Hazard Model highlights fundamental shortcomings, including overclassification, weak hazard differentiation, threshold misalignment, and poor predictive performance.
→ Even in a probabilistic framework, a well-calibrated model should assign higher probabilities to areas that actually burned. If a random number generator performs equally well, the model isn’t providing meaningful risk estimates.
→ If the model fails within its own training data, why should we trust it for long-term predictions? A good model should show at least some predictive power over shorter time frames.
→ While class imbalance affects PR AUC, a useful model should still outperform random chance. The fact that a random number generator produced the same results suggests the model is not learning meaningful patterns.
→ Even if wildfire occurrence has stochastic elements, a
useful model should still rank high-risk areas above low-risk
ones. The model’s failure to do so suggests it
is not capturing meaningful fire risk patterns at all.
→ If fire spread is inherently unpredictable at the property
level, then the model should not be used to inform micro-level policies
such as tax lot classifications. A model that cannot reliably
distinguish risk at fine scales should not drive mitigation
requirements, insurance policies, or property regulations.
→ If all models struggle this much, perhaps we need a different modeling approach altogether. A simpler statistical model might perform just as well or better while being more interpretable.
A primary concern is the significant inflation of fire risk classifications. The model assigns high hazard labels to nearly 60% of tax lots, even though the original intent was to classify only the highest-risk 10%.
One of the most concerning findings is that the model does not effectively separate areas that have historically burned from those that have not.
This analysis strongly suggests that alternative modeling approaches—including simpler, better-calibrated models or different feature selection methods—are necessary to ensure wildfire risk assessments provide useful, actionable, and transparent insights.
The findings of this evaluation raise serious concerns about the reliability and applicability of the Oregon Wildfire Hazard Model. Despite being designed to inform policy decisions, emergency planning, and wildfire mitigation strategies, the model fails to distinguish between historically burned and non-burned areas—even on the data it was trained on. Typically, models perform better on their training data than on unseen data, but this model fails even in that setting, suggesting it is not learning meaningful wildfire risk patterns at all.
The model’s overclassification of high-risk areas, weak differentiation between hazard tiers, and contradictions in its Wildland-Urban Interface (WUI) assumptions further undermine its credibility. The fact that a random number generator produced equivalent predictive performance highlights a fundamental flaw: the model adds no real predictive value beyond chance. If a hazard model cannot provide reliable and actionable predictions, then its use for tax lot-level policies, insurance decisions, and mitigation planning is not just ineffective but potentially harmful.
The consequences of using an unreliable model extend beyond poor predictions—misclassified risk scores could impose unnecessary financial burdens on property owners, misallocate state and federal resources, and undermine public trust in fire mitigation policies. Moreover, if wildfire spread is inherently too stochastic to be accurately predicted at the property level, then using such a model to drive micro-level policy decisions—such as tax lot regulations—is not just ineffective but unjustifiable. A model that lacks predictive accuracy at fine scales should not be used to impose regulatory or financial consequences on property owners and communities.
Given these issues, it is imperative that alternative modeling approaches be considered. Future work should focus on simpler, better-calibrated statistical models, transparent feature selection, and improved validation techniques to ensure that wildfire risk assessments provide useful, actionable, and scientifically sound insights. Without these improvements, decision-makers risk enacting policies that do more harm than good—placing unnecessary burdens on communities while failing to improve wildfire preparedness and mitigation.