Predicting Marshall County Crop Loss Using Historical Weather Data
Bethany Poulin
Predicting Marshall County Crop Loss Using Historical Weather Data
Experimental Crop Damage Model
Motivation for Analysis
- Insurance Companies Spend Millions of Dollars Adjusting Crop Loss Each Year
- In Extreme Weather Events, Smaller Claims Take a Back Seat to Larger More Pressing Claims
- Cost of Adjusting Small Claims Might be Greater Than Settling Quickly with Acurate Estimate
- A Predictive Model, Geographically Identified, With Reasonable Predictive Power Great Asset
Practical Performance Guidelines
- Accuracy in Identifying a True Loss
- Reasonably Accurate Loss Estimate
- Realtime Geolocated Predictions
Assumptions
The utility of this model is predicated on the assumption that time, money and manpower delay the adjustment of crop claims, that larger claims get more attention and receive a higher priority than smaller claims, leaving them left unsettled by insurers for too long.
Choosing a Geography to Analyze
Iowa was chosen as the state to look in for a county to use as the test geography for four major reasons:
- Heavily agricultural state with vast areas of wheat and corn, which are prone to weather related damage
- Consistent topography with moderate elevation changes, which facilitates the building and testing of a small, scale model
- Crops with predictable and consistent yields (corn and wheat are easier to estimate than say apples or peaches)
- An ecosystem prone to extreme weather events from freezing cold, to droughts, to tornadoes and hail.
Within Iowa, Marshall county became the choice because it is a highly agricultural county with known historical crop-loss events that vary from year to year and storm to storm.
# data for maps is included in package
states <- map_data("state")
iowa_geo <- subset(states, region == "iowa")
counties <- map_data("county")
iowa_county <- subset(counties, region == "iowa")
marshall <- subset(iowa_county, subregion == "marshall")
base <- ggplot(data = iowa_geo, mapping = aes(x = long, y = lat, group = group)) +
coord_fixed(1.3) + geom_polygon(data = iowa_county, fill = NA, color = "white") +
geom_polygon(color = "black", fill = NA) + geom_polygon(data = marshall,
fill = "#a65c85b2", color = "white")
base
Marshall County
Data Sources & Acquisition
In order to do meaningful analysis of infrequent events, it was necessary to acquire as much weather and event data as possible, while at the same time maintaining the level of granularity in both sets.
National Oceanographic Atmospheric Administration’s Storm Events Data - via R package ‘noaastormevents’(directly acceses data)
National Oceanographic Atmospheric Administration’s Climate Data Online - in a CSV with data from 01-01-1999 to 12-31-2010, for Marshall County, Iowa
Reading Data from NOAA Storm Events Database
The data comes in with 15 fields:
begin_date | end_date | state | cz_type | cz_name | event_type | source | injuries_direct | injuries_indirect | deaths_direct | deaths_indirect | damage_property | damage_crops | fips |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1999-01-01 | 1999-01-01 | Louisiana | C | Bossier | Tornado | NWS STORM SURVEY | 0 | 0 | 0 | 0 | 60000 | 0 | 22015 |
1999-01-01 | 1999-01-01 | Louisiana | C | Webster | Tornado | NWS STORM SURVEY | 0 | 0 | 0 | 0 | 88000 | 0 | 22119 |
1999-01-01 | 1999-01-01 | Texas | C | Trinity | Hail | LAW ENFORCEMENT | 0 | 0 | 0 | 0 | 5000 | 0 | 48455 |
1999-01-01 | 1999-01-01 | Texas | C | Montgomery | Tornado | TRAINED SPOTTER | 0 | 0 | 0 | 0 | 75000 | 0 | 48339 |
1999-01-01 | 1999-01-01 | Texas | C | Liberty | Tornado | NWS STORM SURVEY | 0 | 0 | 0 | 0 | 175000 | 0 | 48291 |
1999-01-01 | 1999-01-01 | Texas | C | Walker | Tornado | LAW ENFORCEMENT | 0 | 0 | 0 | 0 | 15000 | 0 | 48471 |
Comparing Weather Driven Crop Loss Across Iowa
From year to year loss events differ in numbers and degree. The following maps are here to give a sense of how variable weather related crop loss can be in Iowa.
2001 Crop Damage & Number of Storm Events
map_events(Iowa_events_01, states = "Iowa", plot_type = "crop damage")
map_events(Iowa_events_01, states = "Iowa", plot_type = "number of events")
2005 Crop Damage & Number of Storm Events
map_events(Iowa_events_05, states = "Iowa", plot_type = "crop damage")
map_events(Iowa_events_05, states = "Iowa", plot_type = "number of events")
Marshall County Events Data
Because this is our target data for the model, all that was kept were the starting and ending event dates, as well as the crop damages for each of the storms reported for Marshall County, Iowa.
begin_date | end_date | damage_crops |
---|---|---|
1999-01-01 | 1999-01-02 | 0 |
1999-02-11 | 1999-02-11 | 0 |
1999-03-08 | 1999-03-08 | 0 |
1999-03-17 | 1999-03-17 | 0 |
1999-04-06 | 1999-04-15 | 0 |
1999-04-08 | 1999-04-08 | 0 |
1999-04-08 | 1999-04-08 | 0 |
1999-04-22 | 1999-04-27 | 0 |
1999-05-10 | 1999-05-10 | 0 |
1999-05-16 | 1999-05-16 | 0 |
1999-05-16 | 1999-05-20 | 25000 |
1999-05-21 | 1999-05-25 | 10000 |
1999-06-04 | 1999-06-04 | 5000 |
1999-06-08 | 1999-06-08 | 3000 |
1999-06-09 | 1999-06-15 | 75000 |
Reading in the Weather Data
There were multiple sources of daily weather data within NOAA, my chosen one was easily accessible at the county level using the R package countyweather
.
However, at the time of the analysis, the package was experiencing technical difficulties accessing granular data to convert to a county level 15-minute intervals for a lagged linear model.
Instead daily averages were acquired in csv format, using the Climate Data Search Tool.
Weather Data Frame
TOBS | TMIN | TMAX | SNWD | SNOW | PRCP | DATE | ELEVATION | LONGITUDE | LATITUDE |
---|---|---|---|---|---|---|---|---|---|
7 | -10 | 8 | 3 | 0 | 0.00 | 1999-01-01 | 265.2 | -92.9244 | 42.0647 |
16 | 7 | 16 | 9 | 6 | 0.73 | 1999-01-02 | 265.2 | -92.9244 | 42.0647 |
9 | 9 | 18 | 12 | 3 | 0.20 | 1999-01-03 | 265.2 | -92.9244 | 42.0647 |
-12 | -12 | 9 | 8 | 0 | 0.00 | 1999-01-04 | 265.2 | -92.9244 | 42.0647 |
-10 | -26 | -7 | 8 | 0 | 0.00 | 1999-01-05 | 265.2 | -92.9244 | 42.0647 |
21 | -10 | 21 | 8 | 0 | 0.02 | 1999-01-06 | 265.2 | -92.9244 | 42.0647 |
Summary Statistics of Weather Variables
vars | n | mean | sd | median | trimmed | mad | min | max | range | skew | kurtosis | se | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
TOBS | 1 | 4378 | 41.9641389 | 20.3754533 | 44 | 43.0724886 | 23.7216 | -29 | 81.00 | 110.00 | -0.4111246 | -0.5193553 | 0.3079423 |
TMIN | 2 | 8726 | 37.3964016 | 20.9207917 | 38 | 38.5012890 | 23.7216 | -34 | 78.00 | 112.00 | -0.3830204 | -0.6182564 | 0.2239600 |
TMAX | 3 | 8725 | 58.6043553 | 22.3965865 | 62 | 59.9469990 | 26.6868 | -10 | 99.00 | 109.00 | -0.4305054 | -0.8745963 | 0.2397724 |
SNWD | 4 | 8262 | 0.4852336 | 1.8209297 | 0 | 0.0045386 | 0.0000 | 0 | 26.00 | 26.00 | 5.0385890 | 31.7862020 | 0.0200332 |
SNOW | 5 | 9776 | 0.0721358 | 0.5007687 | 0 | 0.0000000 | 0.0000 | 0 | 13.00 | 13.00 | 10.2348102 | 139.0760720 | 0.0050647 |
PRCP | 6 | 14450 | 0.0975612 | 0.3097190 | 0 | 0.0204663 | 0.0000 | 0 | 4.83 | 4.83 | 5.7573139 | 47.1445766 | 0.0025765 |
Combining Data for Analysis
Sets are combined to facilitate a linear regression model intended to predict the dollar value of loss associated with each storm event.
This allows supervised methods to access predictors and outcome variables in a single data frame
marshall_whole <- left_join(marshall_all_weather, marshall_events, by = c(DATE = "end_date"))
marshall_whole <- marshall_whole %>% mutate(damage_crops = round(damage_crops,
0))
This is the resulting set:
TOBS | TMIN | TMAX | SNWD | SNOW | PRCP | DATE | ELEVATION | LONGITUDE | LATITUDE | damage_crops |
---|---|---|---|---|---|---|---|---|---|---|
7 | -10 | 8 | 3 | 0 | 0.00 | 1999-01-01 | 265.2 | -92.9244 | 42.0647 | NA |
16 | 7 | 16 | 9 | 6 | 0.73 | 1999-01-02 | 265.2 | -92.9244 | 42.0647 | 0 |
9 | 9 | 18 | 12 | 3 | 0.20 | 1999-01-03 | 265.2 | -92.9244 | 42.0647 | NA |
-12 | -12 | 9 | 8 | 0 | 0.00 | 1999-01-04 | 265.2 | -92.9244 | 42.0647 | NA |
-10 | -26 | -7 | 8 | 0 | 0.00 | 1999-01-05 | 265.2 | -92.9244 | 42.0647 | NA |
Initial Model
Based on the most complete variables in the data set, with less than 30% NA values, were used to builld the model’s training and test sets.
divider <- nrow(marshall_whole) * 0.7
train <- marshall_whole[1:divider, ]
test <- marshall_whole[divider + 1:nrow(marshall_whole), ]
Independent Variables
-Minimum Daily Temperature in 5/100ths of a degree - Maximum Daily Temperature in 5/100ths of a degree - Daily Aggregate of Precipitation (by weight)
fit <- lm(damage_crops ~ TMAX + TMIN + PRCP, data = train, na.action = na.omit)
require(xtable)
print(xtable(summary(fit)), type = "html", floating = TRUE)
Estimate | Std. Error | t value | Pr(>|t|) | |
---|---|---|---|---|
(Intercept) | -592118.3112 | 5731357.1640 | -0.10 | 0.9178 |
TMAX | 17373.9025 | 218495.9227 | 0.08 | 0.9367 |
TMIN | 40754.8016 | 233753.6291 | 0.17 | 0.8618 |
PRCP | -2291742.2985 | 3711319.5818 | -0.62 | 0.5378 |
Based on the summary statistics, it is pretty easy to see that none of the variables is significant.
The \(R^2\) values confirm the suspicion that this may not be a highly predictive model.
\(R^2\) : 0.0062327
Adjusted \(R^2\) : -0.0125177
The real test is in the model’s ability to predict a loss, and accurately quantify loss.
Predictions & Evaluation
Model Accuracy: 46.2 %
Estimate of Business Performance
Payouts <- test %>% filter(test$Correct == 1 & test$Prediction > 0 & !is.na(test$Correct)) %>%
mutate(Discrepancy = damage_crops - Prediction) %>% select(DATE, TMIN, TMAX,
PRCP, damage_crops, Prediction, Correct, Discrepancy)
Net Discrepancy: \(\$\) 167483616
Average Discrepancy: \(\$\) 3418033
Standard Deviation of Discrepancy: \(\$\) 35699571
From looking at events one-by-one on this table you can see clearly that we did not do a great job of predicting loss. So, where to next?
How Does Time Affect the Contribution of Variables to the Model?
Gasparini Distributed Lag Models
- Each predictor varible is passed into a function, which allows the consideration of a variety of short and long-term lags in the estimation of a global model
- Tempratures were give 5-day lag windows to account for localized heat or cold extremes
- Precipitation was given a 20-day lag window to allow for possible flood and draught conditions
# Build the lag into variables
cross.tmax <- crossbasis(train$TMAX, lag = 5, argvar = list(df = 5), arglag = list(fun = "strata",
breaks = 1))
cross.tmin <- crossbasis(train$TMIN, lag = 5, argvar = list(df = 5), arglag = list(fun = "strata",
breaks = 1))
cross.prcp <- crossbasis(train$PRCP, lag = 20, argvar = list(fun = "lin"), arglag = list(fun = "poly",
degree = 4))
# Construct regression using the lagged variables
new_fit <- lm(damage_crops ~ cross.prcp + cross.tmax + cross.tmin, train)
# Predict the new fit over increment of .1 using a cumulative lag paradigm
pre.tmax <- crosspred(cross.tmax, new_fit, at = 0:10, bylag = 0.1, cumul = TRUE)
# Plots of Graphs for Each Predictor incorporating Lag
mar.default <- c(5, 4, 4, 2) + 0.1
par(mar = mar.default + c(0, 4, 0, 0))
plot(pre.tmax, "slices", var = 5, col = 3, xlim = c(0, 5), las = 1, ylim = c(-600000,
1000000), ylab = "", ci.arg = list(density = 25, lwd = 2), main = "Lag Association of Daily Maximum Temperature")
par(mar = mar.default + c(0, 4, 0, 0))
pre.prcp <- crosspred(cross.tmin, new_fit, at = 0:20, bylag = 0.2, cumul = TRUE)
plot(pre.prcp, "slices", var = 5, col = 3, xlim = c(-0.1, 5.5), las = 1, ylim = c(-1900000,
1250000), ylab = "", ci.arg = list(density = 25, lwd = 2), main = "Lag Association of Daily Minimim Temperature")
pre.prcp <- crosspred(cross.prcp, new_fit, at = 0:20, bylag = 0.2, cumul = TRUE)
par(mar = mar.default + c(0, 4, 0, 0))
plot(pre.prcp, "slices", var = 5, col = 2, xlim = c(-1, 22), las = 1, ylim = c(-14650000,
29000000), ylab = "", ci.arg = list(density = 25, lwd = 2), main = "Lag Association of Daily Precipitation")
Although no final linear model is constructed from the lag-converted predictor variables, there is a clear sense that the contribution a variable has on the outcome varies with the inclusion of past days weather events, particularly in the case of precipitation. # Conclusions
- Even a model built upon limited data has some ability to predict whether or not a loss occured (currently at 46% accuracy)
- With more frequent observations, individual claims data and a well constructed lag-model the probability of predicting a loss would likely adequate to justify an automated claim
References
Search | Climate Data Online (CDO) | National Climatic Data Center (NCDC). (n.d.). Retrieved December 11, 2017, from https://www.ncdc.noaa.gov/cdo-web/search?datasetid=GHCND
Overview of noaastormevents. (n.d.). Retrieved December 11, 2017, from https://cran.r-project.org/web/packages/noaastormevents/vignettes/noaastormevents.html
Datasets | Climate Data Online (CDO) | National Climatic Data Center (NCDC). (n.d.). Retrieved December 11, 2017, from https://www.ncdc.noaa.gov/cdo-web/datasets#GHCND
Gasparrini, A., Scheipl, F., Armstrong, B., & Kenward, M. G. (2017). A penalized framework for distributed lag non-linear models. Biometrics, 73(3), 938–948. https://doi.org/10.1111/biom.12645
Gasparrini A. Distributed lag linear and non-linear models in R: the package dlnm. Journal of Statistical Software. 2011;43(8):1-20.
Gasparrini A, Armstrong B, Kenward MG. Distributed lag non-linear models. Statistics in Medicine. 2010;29(21):2224-2234.