Predicting Crop Loss Using Historical Weather Data

Experimental Crop Damage Model

This analysis looks at crop damages relative to weather conditions to try to build a tool for estimating loss associated with prolonged extreme weather events. The model itself will have a limited geographic scope, to make building a model and testing the assumptions easier during the defining stages of model design, and scaled with added counties, states and data sources as methods prove useful.

Framework for Analysis

Motivation & Intended Use

Compensations for crop losses are a huge expenditure for companies insuring farms across the United States. When there are widespread extreme weather events contributing to crop damage it can cost insurers millions of dollars in claims, but there is also the added cost associated with adjustment and assessment. Each claim, whether paid or not, must be evaluated to confirm both existence and magnitude of loss. Sending an adjust to process a relatively small loss, when there are many claims including large ones to be adjusted, can add significant cost to the processing of claims for a given event.

If the cost of adjusting a small claim is extreme relative to the payout, it might make more sense for an insurer to simply pay the claim provided the agent has adequate confidence that an event occurred capable of causing the loss claimed by the insured.

Currently most losses are human-adjusted, with a growing number of technological tools, such as drones and satellite imagery, used to improve the speed and results of the inspection process.

Having models which predict a reasonable claim payout based on historical weather and losses could be of use in this situation. If the model suggest a value that is within an reasonable range, based on proximity to the event, and historical losses incurred in other such events, an insurer could simply issue a check, instead of investing human capital to investigate.

This would allow insurers to focus their human capital on the most pressing claims, which save or cost them the most when erroneously adjusted.

Practical Performance Guidelines

Performance of a model like this would be best evaluated on how often it is correct in assigning value (accuracy of loss/no loss) and how close to predicting a correct payout amount for those right predictions (net difference) and whether it over or under pays. If the model were to under predict payouts, requiring manual reviews for contested findings, that would be better than over recognizing events and suggesting loss when there was none. This would make it more robust against fraud. If it were relatively close in payments, but reduced costs, it might be worth using as a first estimate of whether or not the claim filed was justified. The agents would have acreage, valuation and percent of loss in hand already, so if the model suggested a loss, and claim were lower than the loss suggested by the model for the area in question, it might be an adequate solution to clearing a backlog of smaller filings.

These types of models could be used in other areas of insurance, and similar models would be helpful in estimating disaster recovery costs for state & federal agencies working in relief after natural disasters. The models would be similar, although the outcome might be water, man-hours of service, food or medicines instead of claims dollars.

Assumptions

The utility of this model is predicated on the assumption that time, money and manpower delay the adjustment of crop claims, that larger claims get more attention and receive a higher priority. This means that smaller claims become an added burden upon insurers when faced with many larger more complex claims, and that the cost of adjusting smaller claims might be disproportionate with the cost of diverting manpower away from working on larger claims where erroneous estimates are more detrimental to the financial security of the insurer.

Choosing a Geography to Analyze

Iowa was chosen as the state to look in for a county to use as the test geography for four major reasons:

  1. Heavily agricultural state with vast areas of wheat and corn, which are prone to weather related damage
  2. Consistent topography with moderate elevation changes, which facilitates the building and testing of a small, scale model
  3. Crops with predictable and consistent yields (corn and wheat are easier to estimate than say apples or peaches)
  4. An ecosystem prone to extreme weather events from freezing cold, to droughts, to tornadoes and hail.

Within Iowa, Marshall county became the choice because it is a highly agricultural county with known historical crop-loss events.

# data for maps is included in package
states <- map_data("state")
iowa_geo <- subset(states, region == "iowa")

counties <- map_data("county")
iowa_county <- subset(counties, region == "iowa")

marshall <- subset(iowa_county, subregion == "marshall")

base <- ggplot(data = iowa_geo, mapping = aes(x = long, y = lat, group = group)) + 
    coord_fixed(1.3) + geom_polygon(data = iowa_county, fill = NA, color = "white") + 
    geom_polygon(color = "black", fill = NA) + geom_polygon(data = marshall, 
    fill = "#a65c85b2", color = "white")
base

iowa <- read.csv("https://www2.census.gov/geo/docs/reference/codes/files/st19_ia_cou.txt", 
    header = FALSE)
iowa <- iowa %>% rename(State = V1) %>% rename(State_Code = V2) %>% rename(County_Code = V3) %>% 
    rename(County = V4) %>% rename(Active_Code = V5) %>% mutate(County_Code = str_pad(County_Code, 
    3, pad = "0")) %>% mutate(State_Code = str_pad(State_Code, 3, pad = "0")) %>% 
    mutate(FIPS = paste0(State_Code, County_Code))

# Get FIPS Code
marshall_FIPS <- iowa %>% filter(County == "Marshall County") %>% pull(FIPS)

The US Census FIPS code for Marshall County, Iowa is : 019127

Data Sources & Acquisition

In order to do meaningful analysis of infrequent events, it was necessary to acquire as much weather and event data as possible, while at the same time maintaining the level of granularity in the events set.

To run the analysis two sets of data were queried 1. National Oceanographic Atmospheric Administration’s Storm Events Data - via R package ‘noaastormevents’ 2 National Oceanographic Atmospheric Administration’s Climate Data Online - in a CSV with data from 01-01-1999 to 12-31-2010, for Marshall County, Iowa

Reading Data from NOAA Storm Events Database

# find_events() is written to download all files and sort out what you
# request
all_events <- find_events(date_range = c("1999-01-01", "2010-12-31"))
begin_date end_date state cz_type cz_name event_type source injuries_direct injuries_indirect deaths_direct deaths_indirect damage_property damage_crops fips
1999-01-01 1999-01-01 Louisiana C Bossier Tornado NWS STORM SURVEY 0 0 0 0 60000 0 22015
1999-01-01 1999-01-01 Louisiana C Webster Tornado NWS STORM SURVEY 0 0 0 0 88000 0 22119
1999-01-01 1999-01-01 Texas C Trinity Hail LAW ENFORCEMENT 0 0 0 0 5000 0 48455
1999-01-01 1999-01-01 Texas C Montgomery Tornado TRAINED SPOTTER 0 0 0 0 75000 0 48339
1999-01-01 1999-01-01 Texas C Liberty Tornado NWS STORM SURVEY 0 0 0 0 175000 0 48291
1999-01-01 1999-01-01 Texas C Walker Tornado LAW ENFORCEMENT 0 0 0 0 15000 0 48471

Massaging Storm Events data to be more useful, filtering for counties names Marshall in the state of Iowa, selecting only the beginning & end dates as well as crop damage to bind to the weather data frame.

marshall_events <- all_events %>% filter(state == "Iowa") %>% filter(cz_name == 
    "Marshall") %>% select(begin_date, end_date, damage_crops)
begin_date end_date damage_crops
1999-01-01 1999-01-02 0
1999-02-11 1999-02-11 0
1999-03-08 1999-03-08 0
1999-03-17 1999-03-17 0
1999-04-06 1999-04-15 0
1999-04-08 1999-04-08 0

Comparing Weather Driven Crop Loss Across Iowa

From year to year loss events differ in numbers and degree. The following maps are here to give a sense of how variable weather related crop loss can be in Iowa.

Iowa_events_01 <- all_events %>% filter(state == "Iowa") %>% mutate(DATE = as.Date(end_date, 
    format = "%m/%d/%y")) %>% filter(year(DATE) %in% c(2001))
Iowa_events_05 <- all_events %>% filter(state == "Iowa") %>% mutate(DATE = as.Date(end_date, 
    format = "%m/%d/%y")) %>% filter(year(DATE) %in% c(2005))

2001 Crop Damage & Number of Storm Events

map_events(Iowa_events_01, states = "Iowa", plot_type = "crop damage")

map_events(Iowa_events_01, states = "Iowa", plot_type = "number of events")

2005 Crop Damage & Number of Storm Events

map_events(Iowa_events_05, states = "Iowa", plot_type = "crop damage")

map_events(Iowa_events_05, states = "Iowa", plot_type = "number of events")

As you can see, there are many storm events with little to no reported losses, which is why we needed such a number of years to move forward with the analysis.

Reading in the Weather Data

The downloaded file was read in and then polished for merging with the Marshall Events data. In choosing data to use in the analysis, I calculated the number of NA values in a given column which could possibly be considered a predictor variable, keeping only the variables where at least 70 of the observations were not NA values.

The regression model will not know how to attribute multiple values to the same variable in the same location on the same day with a different outcome. So, loss data from the same day, was combined to create a single target day for the summarized weather data on each day.

marshall_all_weather <- read.csv("https://raw.githubusercontent.com/bpoulin-CUNY/Data607/master/Data%20607%20Final/marshall_weather.csv", 
    stringsAsFactors = FALSE, header = TRUE)

# Build list to iterate over
var_names <- c("LATITUDE", "LONGITUDE", "ELEVATION", "DATE", "AWND", "DAPR", 
    "FMTM", "MDPR", "PGTM", "PRCP", "SNOW", "SNWD", "TAVG", "TMAX", "TMIN", 
    "TOBS", "TSUN", "WDF2", "WDF5", "WESD", "WESF", "WSF2", "WT01", "WT02", 
    "WT03", "WT04", "WT05", "WT06", "WT08", "WT09")
# Create Data Frame of NA ratios to use in excluding variables
nas <- data.frame(Variable = character(0), Quantity = numeric(0), Proportion = numeric(0))
for (i in var_names) {
    num = sum(is.na(marshall_all_weather[i]))
    prop = num/nrow(marshall_all_weather)
    temp = data.frame(i, num, prop)
    colnames(temp) = c("Variable", "Quantity", "Proportion")
    nas <- rbind(temp, nas)
}

# Build a list of variables to keep using logicals
keep <- nas %>% mutate(Variable = as.character(Variable)) %>% filter(Proportion < 
    0.7) %>% pull(Variable)
# Format Date correctly & reduce to only variables with sufficient data to
# feed into model
marshall_all_weather <- marshall_all_weather %>% mutate(DATE = as.Date(DATE, 
    format = "%m/%d/%y")) %>% 
select(keep, -WSF5)

# Combining Events Data
marshall_events <- marshall_events %>% mutate(damage_crops = round(damage_crops, 
    0)) %>% group_by(end_date) %>% summarise(damage_crops = sum(damage_crops))
TOBS TMIN TMAX SNWD SNOW PRCP DATE ELEVATION LONGITUDE LATITUDE
7 -10 8 3 0 0.00 1999-01-01 265.2 -92.9244 42.0647
16 7 16 9 6 0.73 1999-01-02 265.2 -92.9244 42.0647
9 9 18 12 3 0.20 1999-01-03 265.2 -92.9244 42.0647
-12 -12 9 8 0 0.00 1999-01-04 265.2 -92.9244 42.0647
-10 -26 -7 8 0 0.00 1999-01-05 265.2 -92.9244 42.0647
21 -10 21 8 0 0.02 1999-01-06 265.2 -92.9244 42.0647

Combining Data for Analysis

The two sets marshall_events and marshall_all_weather are joined together on the DATE field in the daily summary,marshall_all_weather , and the end_date in the marshall_events database. The new data frame is saved in the variable name marshall_whole.

This allows supervised methods to access predictors and outcome variables in a single data frame

marshall_whole <- left_join(marshall_all_weather, marshall_events, by = c(DATE = "end_date"))
marshall_whole <- marshall_whole %>% mutate(damage_crops = round(damage_crops, 
    0))

This is the resulting set:

TOBS TMIN TMAX SNWD SNOW PRCP DATE ELEVATION LONGITUDE LATITUDE damage_crops
7 -10 8 3 0 0.00 1999-01-01 265.2 -92.9244 42.0647 NA
16 7 16 9 6 0.73 1999-01-02 265.2 -92.9244 42.0647 0
9 9 18 12 3 0.20 1999-01-03 265.2 -92.9244 42.0647 NA
-12 -12 9 8 0 0.00 1999-01-04 265.2 -92.9244 42.0647 NA
-10 -26 -7 8 0 0.00 1999-01-05 265.2 -92.9244 42.0647 NA
21 -10 21 8 0 0.02 1999-01-06 265.2 -92.9244 42.0647 NA

Initial Model

For this analysis, despite the fact that each observation is discrete for a given day, because there are days with a full weather profile and a zero damage estimate, it is important to preserve some of the time-series nature of the data, to use in evaluating the effectiveness of this particular model.

Accordingly the training set was taken from the first 70% of observations, and the test set the final 30%. using simple methods.

divider <- nrow(marshall_whole) * 0.7
train <- marshall_whole[1:divider, ]
test <- marshall_whole[divider + 1:nrow(marshall_whole), ]

From the training set, a multiple linear regression is built with damage_crops as the outcome variable. If this prediction model works, the resulting output for a set of independent weather variables should be the amount of crop loss associated with that particular day. Because they are available for all outcome days, the model was built using Minimum Temperature, Maximum Temperature and Precipitation as the independent variables

fit <- lm(damage_crops ~ TMAX + TMIN + PRCP, data = train, na.action = na.omit)
require(xtable)
print(xtable(summary(fit)), type = "html", floating = TRUE)
Estimate Std. Error t value Pr(>|t|)
(Intercept) -592118.3112 5731357.1640 -0.10 0.9178
TMAX 17373.9025 218495.9227 0.08 0.9367
TMIN 40754.8016 233753.6291 0.17 0.8618
PRCP -2291742.2985 3711319.5818 -0.62 0.5378
par(mfrow = c(2, 2))
plot(fit)

Based on the summary statistics, it is pretty easy to see that none of the variables listed and used in the current context contribute in a significant manner to this model. And a quick glance at the \(R^2\) values confirm the suspicion that this may not be a highly predictive model!

\(R^2\) : 0.0062327

Adjusted \(R^2\) : -0.0125177

Let’ see how the model works on the test set

Predictions & Evaluation

# simple prediction based on the training model
test$Prediction <- predict(fit, newdata = test)
# assigning 1 to observations predicted no loss with no loss & predicted
# loss with a loss, zero to all others.  This will become our accuracy
# check, the mean of ones
test <- test %>% mutate(Correct = ifelse((test$Prediction <= 0 & test$damage_crops <= 
    0) | (test$Prediction > 0 & test$damage_crops > 0), 1, 0)) %>% mutate(Prediction = round(Prediction, 
    0))
test_accuracy <- mean(test$Correct, na.rm = TRUE)

Model Accuracy: 46.2 %

Although this model was accurate in predicting a loss 46.2% of the time, we need to look closer to see how much each storm prediction was off, in terms of claim value.

Estimate of Business Performance

By looking at only the correct predictions, where the model suggested a payout (columntest$first is greater than zero and not NA) I created a data frame with a new column called discrepancy. For this new frame, if the damage_crop value was higher than the first (meaning first model) then the Discrepancy column would have a positive value, meaning the model underestimated loss. If the value were negative, the model overestimated loss (bad because it would be over paying).

damage_crops Prediction and Discrepancy are all shown in US dollars.

Payouts <- test %>% filter(test$Correct == 1 & test$Prediction > 0 & !is.na(test$Correct)) %>% 
    mutate(Discrepancy = damage_crops - Prediction) %>% select(DATE, TMIN, TMAX, 
    PRCP, damage_crops, Prediction, Correct, Discrepancy)

From looking at events one-by-one on this table you can see clearly that we did not do a great job of predicting loss. And looking at the total discrepancy, average and standard deviations of differences below, clearly this model has a long way to go before is is “operational”

Net Discrepancy: \(\$\) 167483616

Average Discrepancy: \(\$\) 3418033

Standard Deviation of Discrepancy: \(\$\) 35699571

From looking at events one-by-one on this table you can see clearly that we did not do a great job of predicting loss. So, where to next?

Possible Avenues to Explore

Because plants grow and die over time, absorbe water over time, generate sugar from light over time, it makes perfect sense that in most cases, the effect of a weather event on crops would evolve over time.

This initial model did not take into consideration how the days prior to an event, or even the number of days within an event (such as a slow-moving winter storm or a drought). To improve the quality of the model, it might be prudent to take into consideration how each of the variables affects a plant and incorporated lagged variables into the equations.

Gasparini Distributed Lag Models

Although we do not have granular enough observations and variables to build a more predictive model from this data, we can use Gasparrini’s distributed lag linear and non-linear methods to attribute the compounded effects of time on each of the variables to create plots of the relationship between a variable and lag periods.

# Build the lag into variables
cross.tmax <- crossbasis(train$TMAX, lag = 5, argvar = list(df = 5), arglag = list(fun = "strata", 
    breaks = 1))
cross.tmin <- crossbasis(train$TMIN, lag = 5, argvar = list(df = 5), arglag = list(fun = "strata", 
    breaks = 1))
cross.prcp <- crossbasis(train$PRCP, lag = 20, argvar = list(fun = "lin"), arglag = list(fun = "poly", 
    degree = 4))

# Construct regression using the lagged variables
new_fit <- lm(damage_crops ~ cross.prcp + cross.tmax + cross.tmin, train)


# Predict the new fit over increment of .1 using a cumulative lag paradigm
pre.tmax <- crosspred(cross.tmax, new_fit, at = 0:10, bylag = 0.1, cumul = TRUE)

# Plots of Graphs for Each Predictor incorporating Lag
mar.default <- c(5, 4, 4, 2) + 0.1
par(mar = mar.default + c(0, 4, 0, 0))
plot(pre.tmax, "slices", var = 5, col = 3, xlim = c(0, 5), las = 1, ylim = c(-600000, 
    1000000), ylab = "", ci.arg = list(density = 25, lwd = 2), main = "Lag Association of Daily Maximum Temperature")

par(mar = mar.default + c(0, 4, 0, 0))
pre.prcp <- crosspred(cross.tmin, new_fit, at = 0:20, bylag = 0.2, cumul = TRUE)
plot(pre.prcp, "slices", var = 5, col = 3, xlim = c(-0.1, 5.5), las = 1, ylim = c(-1900000, 
    1250000), ylab = "", ci.arg = list(density = 25, lwd = 2), main = "Lag Association of Daily Minimim Temperature")

pre.prcp <- crosspred(cross.prcp, new_fit, at = 0:20, bylag = 0.2, cumul = TRUE)

par(mar = mar.default + c(0, 4, 0, 0))
plot(pre.prcp, "slices", var = 5, col = 2, xlim = c(-1, 22), las = 1, ylim = c(-126500000, 
    140000000), ylab = "", ci.arg = list(density = 25, lwd = 2), main = "Lag Association of Daily Precipitation")

The top two Temperature Plots showing a 5-day lag window, clearly lack the granularity to be usefully predict, however, there is a noted increase in each at the one day mark. This would suggest that for periods of extreme temperature longer than a day, the odds of a loss claim increase. However we clearly do not have the right data to accurately model the effect.

Likewise the precipitation graph shows a clear association with three separate changes in direction (and effect) at different lag intervals between 0 and 20 days. On this plot, the confidence interval is also much tighter. This data might be sufficient with more granular temperature data and the addition of other possibly meaningful features such as wind direct, wind speed and sunlight measurements.

Conclusions

Although the linear regression derived of data from the NOAA Storm Events API and the NOAA Climate Database lack sufficient granularity to predict losses directly. The model offers some low confidence predictive power, which when combined with evidence from the lag variable aggregation in Gasparrin models, would suggest that having more granular environmental variables (greater number of environmental features as well as increased frequency and less daily aggregation) there is potential to improve on the model.

It is also my suspicion that with a more precise target, better claims data with numbers of acres, crop type, loss quantity and commodity values at the time of the claim, we could start to build a useful model, where we could predict whether or not the claim was likely valid and then estimate losses in real-time value based on current market rates.

The challenge in refining and implementing this modeling tool is accessing accurate, granular and geographically relevant data consistently to use in an operationalized model and combining it with sufficient claims data to build in accuracy as well as tying including geographic scope.

It is also not clear if the dollar amounts are equillibrated to the current market value. This would be crucial because it is important to use today’s value of the produce as a target to be sure that we are training the model to predict in the current marketplace.

References

Search | Climate Data Online (CDO) | National Climatic Data Center (NCDC). (n.d.). Retrieved December 11, 2017, from https://www.ncdc.noaa.gov/cdo-web/search?datasetid=GHCND

Overview of noaastormevents. (n.d.). Retrieved December 11, 2017, from https://cran.r-project.org/web/packages/noaastormevents/vignettes/noaastormevents.html

Datasets | Climate Data Online (CDO) | National Climatic Data Center (NCDC). (n.d.). Retrieved December 11, 2017, from https://www.ncdc.noaa.gov/cdo-web/datasets#GHCND

Gasparrini, A., Scheipl, F., Armstrong, B., & Kenward, M. G. (2017). A penalized framework for distributed lag non-linear models. Biometrics, 73(3), 938–948. https://doi.org/10.1111/biom.12645

Gasparrini A. Distributed lag linear and non-linear models in R: the package dlnm. Journal of Statistical Software. 2011;43(8):1-20.

Gasparrini A, Armstrong B, Kenward MG. Distributed lag non-linear models. Statistics in Medicine. 2010;29(21):2224-2234.

2017-12-11