Predicting Marshall County Crop Loss Using Historical Weather Data

Experimental Crop Damage Model

Motivation for Analysis

  • Insurance Companies Spend Millions of Dollars Adjusting Crop Loss Each Year
  • In Extreme Weather Events, Smaller Claims Take a Back Seat to Larger More Pressing Claims
  • Cost of Adjusting Small Claims Might be Greater Than Settling Quickly with Acurate Estimate
  • A Predictive Model, Geographically Identified, With Reasonable Predictive Power Great Asset

Practical Performance Guidelines

  • Accuracy in Identifying a True Loss
  • Reasonably Accurate Loss Estimate
  • Realtime Geolocated Predictions

Assumptions

The utility of this model is predicated on the assumption that time, money and manpower delay the adjustment of crop claims, that larger claims get more attention and receive a higher priority than smaller claims, leaving them left unsettled by insurers for too long.

Choosing a Geography to Analyze

Iowa was chosen as the state to look in for a county to use as the test geography for four major reasons:

  1. Heavily agricultural state with vast areas of wheat and corn, which are prone to weather related damage
  2. Consistent topography with moderate elevation changes, which facilitates the building and testing of a small, scale model
  3. Crops with predictable and consistent yields (corn and wheat are easier to estimate than say apples or peaches)
  4. An ecosystem prone to extreme weather events from freezing cold, to droughts, to tornadoes and hail.

Within Iowa, Marshall county became the choice because it is a highly agricultural county with known historical crop-loss events that vary from year to year and storm to storm.

# data for maps is included in package
states <- map_data("state")
iowa_geo <- subset(states, region == "iowa")

counties <- map_data("county")
iowa_county <- subset(counties, region == "iowa")

marshall <- subset(iowa_county, subregion == "marshall")

base <- ggplot(data = iowa_geo, mapping = aes(x = long, y = lat, group = group)) + 
    coord_fixed(1.3) + geom_polygon(data = iowa_county, fill = NA, color = "white") + 
    geom_polygon(color = "black", fill = NA) + geom_polygon(data = marshall, 
    fill = "#a65c85b2", color = "white")
base

Marshall County

Data Sources & Acquisition

In order to do meaningful analysis of infrequent events, it was necessary to acquire as much weather and event data as possible, while at the same time maintaining the level of granularity in both sets.

  1. National Oceanographic Atmospheric Administration’s Storm Events Data - via R package ‘noaastormevents’(directly acceses data)

  2. National Oceanographic Atmospheric Administration’s Climate Data Online - in a CSV with data from 01-01-1999 to 12-31-2010, for Marshall County, Iowa

Reading Data from NOAA Storm Events Database

The data comes in with 15 fields:

begin_date end_date state cz_type cz_name event_type source injuries_direct injuries_indirect deaths_direct deaths_indirect damage_property damage_crops fips
1999-01-01 1999-01-01 Louisiana C Bossier Tornado NWS STORM SURVEY 0 0 0 0 60000 0 22015
1999-01-01 1999-01-01 Louisiana C Webster Tornado NWS STORM SURVEY 0 0 0 0 88000 0 22119
1999-01-01 1999-01-01 Texas C Trinity Hail LAW ENFORCEMENT 0 0 0 0 5000 0 48455
1999-01-01 1999-01-01 Texas C Montgomery Tornado TRAINED SPOTTER 0 0 0 0 75000 0 48339
1999-01-01 1999-01-01 Texas C Liberty Tornado NWS STORM SURVEY 0 0 0 0 175000 0 48291
1999-01-01 1999-01-01 Texas C Walker Tornado LAW ENFORCEMENT 0 0 0 0 15000 0 48471

Comparing Weather Driven Crop Loss Across Iowa

From year to year loss events differ in numbers and degree. The following maps are here to give a sense of how variable weather related crop loss can be in Iowa.

2001 Crop Damage & Number of Storm Events

map_events(Iowa_events_01, states = "Iowa", plot_type = "crop damage")

map_events(Iowa_events_01, states = "Iowa", plot_type = "number of events")

2005 Crop Damage & Number of Storm Events

map_events(Iowa_events_05, states = "Iowa", plot_type = "crop damage")

map_events(Iowa_events_05, states = "Iowa", plot_type = "number of events")

Marshall County Events Data

Because this is our target data for the model, all that was kept were the starting and ending event dates, as well as the crop damages for each of the storms reported for Marshall County, Iowa.

begin_date end_date damage_crops
1999-01-01 1999-01-02 0
1999-02-11 1999-02-11 0
1999-03-08 1999-03-08 0
1999-03-17 1999-03-17 0
1999-04-06 1999-04-15 0
1999-04-08 1999-04-08 0
1999-04-08 1999-04-08 0
1999-04-22 1999-04-27 0
1999-05-10 1999-05-10 0
1999-05-16 1999-05-16 0
1999-05-16 1999-05-20 25000
1999-05-21 1999-05-25 10000
1999-06-04 1999-06-04 5000
1999-06-08 1999-06-08 3000
1999-06-09 1999-06-15 75000

Reading in the Weather Data

There were multiple sources of daily weather data within NOAA, my chosen one was easily accessible at the county level using the R package countyweather.

However, at the time of the analysis, the package was experiencing technical difficulties accessing granular data to convert to a county level 15-minute intervals for a lagged linear model.

Instead daily averages were acquired in csv format, using the Climate Data Search Tool.

Weather Data Frame

TOBS TMIN TMAX SNWD SNOW PRCP DATE ELEVATION LONGITUDE LATITUDE
7 -10 8 3 0 0.00 1999-01-01 265.2 -92.9244 42.0647
16 7 16 9 6 0.73 1999-01-02 265.2 -92.9244 42.0647
9 9 18 12 3 0.20 1999-01-03 265.2 -92.9244 42.0647
-12 -12 9 8 0 0.00 1999-01-04 265.2 -92.9244 42.0647
-10 -26 -7 8 0 0.00 1999-01-05 265.2 -92.9244 42.0647
21 -10 21 8 0 0.02 1999-01-06 265.2 -92.9244 42.0647

Summary Statistics of Weather Variables

vars n mean sd median trimmed mad min max range skew kurtosis se
TOBS 1 4378 41.9641389 20.3754533 44 43.0724886 23.7216 -29 81.00 110.00 -0.4111246 -0.5193553 0.3079423
TMIN 2 8726 37.3964016 20.9207917 38 38.5012890 23.7216 -34 78.00 112.00 -0.3830204 -0.6182564 0.2239600
TMAX 3 8725 58.6043553 22.3965865 62 59.9469990 26.6868 -10 99.00 109.00 -0.4305054 -0.8745963 0.2397724
SNWD 4 8262 0.4852336 1.8209297 0 0.0045386 0.0000 0 26.00 26.00 5.0385890 31.7862020 0.0200332
SNOW 5 9776 0.0721358 0.5007687 0 0.0000000 0.0000 0 13.00 13.00 10.2348102 139.0760720 0.0050647
PRCP 6 14450 0.0975612 0.3097190 0 0.0204663 0.0000 0 4.83 4.83 5.7573139 47.1445766 0.0025765

Combining Data for Analysis

Sets are combined to facilitate a linear regression model intended to predict the dollar value of loss associated with each storm event.

This allows supervised methods to access predictors and outcome variables in a single data frame

marshall_whole <- left_join(marshall_all_weather, marshall_events, by = c(DATE = "end_date"))
marshall_whole <- marshall_whole %>% mutate(damage_crops = round(damage_crops, 
    0))

This is the resulting set:

TOBS TMIN TMAX SNWD SNOW PRCP DATE ELEVATION LONGITUDE LATITUDE damage_crops
7 -10 8 3 0 0.00 1999-01-01 265.2 -92.9244 42.0647 NA
16 7 16 9 6 0.73 1999-01-02 265.2 -92.9244 42.0647 0
9 9 18 12 3 0.20 1999-01-03 265.2 -92.9244 42.0647 NA
-12 -12 9 8 0 0.00 1999-01-04 265.2 -92.9244 42.0647 NA
-10 -26 -7 8 0 0.00 1999-01-05 265.2 -92.9244 42.0647 NA

Initial Model

Based on the most complete variables in the data set, with less than 30% NA values, were used to builld the model’s training and test sets.

divider <- nrow(marshall_whole) * 0.7
train <- marshall_whole[1:divider, ]
test <- marshall_whole[divider + 1:nrow(marshall_whole), ]

Independent Variables

-Minimum Daily Temperature in 5/100ths of a degree - Maximum Daily Temperature in 5/100ths of a degree - Daily Aggregate of Precipitation (by weight)

fit <- lm(damage_crops ~ TMAX + TMIN + PRCP, data = train, na.action = na.omit)
require(xtable)
print(xtable(summary(fit)), type = "html", floating = TRUE)
Estimate Std. Error t value Pr(>|t|)
(Intercept) -592118.3112 5731357.1640 -0.10 0.9178
TMAX 17373.9025 218495.9227 0.08 0.9367
TMIN 40754.8016 233753.6291 0.17 0.8618
PRCP -2291742.2985 3711319.5818 -0.62 0.5378

Based on the summary statistics, it is pretty easy to see that none of the variables is significant.

The \(R^2\) values confirm the suspicion that this may not be a highly predictive model.

\(R^2\) : 0.0062327

Adjusted \(R^2\) : -0.0125177

The real test is in the model’s ability to predict a loss, and accurately quantify loss.

Predictions & Evaluation

Model Accuracy: 46.2 %

Estimate of Business Performance

Payouts <- test %>% filter(test$Correct == 1 & test$Prediction > 0 & !is.na(test$Correct)) %>% 
    mutate(Discrepancy = damage_crops - Prediction) %>% select(DATE, TMIN, TMAX, 
    PRCP, damage_crops, Prediction, Correct, Discrepancy)

Net Discrepancy: \(\$\) 167483616

Average Discrepancy: \(\$\) 3418033

Standard Deviation of Discrepancy: \(\$\) 35699571

From looking at events one-by-one on this table you can see clearly that we did not do a great job of predicting loss. So, where to next?

How Does Time Affect the Contribution of Variables to the Model?

Gasparini Distributed Lag Models

  • Each predictor varible is passed into a function, which allows the consideration of a variety of short and long-term lags in the estimation of a global model
  • Tempratures were give 5-day lag windows to account for localized heat or cold extremes
  • Precipitation was given a 20-day lag window to allow for possible flood and draught conditions
# Build the lag into variables
cross.tmax <- crossbasis(train$TMAX, lag = 5, argvar = list(df = 5), arglag = list(fun = "strata", 
    breaks = 1))
cross.tmin <- crossbasis(train$TMIN, lag = 5, argvar = list(df = 5), arglag = list(fun = "strata", 
    breaks = 1))
cross.prcp <- crossbasis(train$PRCP, lag = 20, argvar = list(fun = "lin"), arglag = list(fun = "poly", 
    degree = 4))

# Construct regression using the lagged variables
new_fit <- lm(damage_crops ~ cross.prcp + cross.tmax + cross.tmin, train)


# Predict the new fit over increment of .1 using a cumulative lag paradigm
pre.tmax <- crosspred(cross.tmax, new_fit, at = 0:10, bylag = 0.1, cumul = TRUE)

# Plots of Graphs for Each Predictor incorporating Lag
mar.default <- c(5, 4, 4, 2) + 0.1
par(mar = mar.default + c(0, 4, 0, 0))
plot(pre.tmax, "slices", var = 5, col = 3, xlim = c(0, 5), las = 1, ylim = c(-600000, 
    1000000), ylab = "", ci.arg = list(density = 25, lwd = 2), main = "Lag Association of Daily Maximum Temperature")

par(mar = mar.default + c(0, 4, 0, 0))
pre.prcp <- crosspred(cross.tmin, new_fit, at = 0:20, bylag = 0.2, cumul = TRUE)
plot(pre.prcp, "slices", var = 5, col = 3, xlim = c(-0.1, 5.5), las = 1, ylim = c(-1900000, 
    1250000), ylab = "", ci.arg = list(density = 25, lwd = 2), main = "Lag Association of Daily Minimim Temperature")

pre.prcp <- crosspred(cross.prcp, new_fit, at = 0:20, bylag = 0.2, cumul = TRUE)

par(mar = mar.default + c(0, 4, 0, 0))
plot(pre.prcp, "slices", var = 5, col = 2, xlim = c(-1, 22), las = 1, ylim = c(-14650000, 
    29000000), ylab = "", ci.arg = list(density = 25, lwd = 2), main = "Lag Association of Daily Precipitation")

Although no final linear model is constructed from the lag-converted predictor variables, there is a clear sense that the contribution a variable has on the outcome varies with the inclusion of past days weather events, particularly in the case of precipitation. # Conclusions

  • Even a model built upon limited data has some ability to predict whether or not a loss occured (currently at 46% accuracy)
  • With more frequent observations, individual claims data and a well constructed lag-model the probability of predicting a loss would likely adequate to justify an automated claim

References

Search | Climate Data Online (CDO) | National Climatic Data Center (NCDC). (n.d.). Retrieved December 11, 2017, from https://www.ncdc.noaa.gov/cdo-web/search?datasetid=GHCND

Overview of noaastormevents. (n.d.). Retrieved December 11, 2017, from https://cran.r-project.org/web/packages/noaastormevents/vignettes/noaastormevents.html

Datasets | Climate Data Online (CDO) | National Climatic Data Center (NCDC). (n.d.). Retrieved December 11, 2017, from https://www.ncdc.noaa.gov/cdo-web/datasets#GHCND

Gasparrini, A., Scheipl, F., Armstrong, B., & Kenward, M. G. (2017). A penalized framework for distributed lag non-linear models. Biometrics, 73(3), 938–948. https://doi.org/10.1111/biom.12645

Gasparrini A. Distributed lag linear and non-linear models in R: the package dlnm. Journal of Statistical Software. 2011;43(8):1-20.

Gasparrini A, Armstrong B, Kenward MG. Distributed lag non-linear models. Statistics in Medicine. 2010;29(21):2224-2234.

2018-01-26