Predicting Marshall County Crop Loss Using Historical Weather Data

Experimental Crop Damage Model
Data Sources & Acquisition
Combining Data for Analysis
Initial Model
- Predictions & Evaluation
How Does Time Affect the Contribution of Variables to the Model?
- Gasparini Distributed Lag Models
References

Experimental Crop Damage Model

Motivation for Analysis

Insurance Companies Spend Millions of Dollars Adjusting Crop Loss Each Year
In Extreme Weather Events, Smaller Claims Take a Back Seat to Larger More Pressing Claims
Cost of Adjusting Small Claims Might be Greater Than Settling Quickly with Acurate Estimate
A Predictive Model, Geographically Identified, With Reasonable Predictive Power Great Asset

Practical Performance Guidelines

Accuracy in Identifying a True Loss
Reasonably Accurate Loss Estimate
Realtime Geolocated Predictions

Assumptions

The utility of this model is predicated on the assumption that time, money and manpower delay the adjustment of crop claims, that larger claims get more attention and receive a higher priority than smaller claims, leaving them left unsettled by insurers for too long.

Choosing a Geography to Analyze

Iowa was chosen as the state to look in for a county to use as the test geography for four major reasons:

Heavily agricultural state with vast areas of wheat and corn, which are prone to weather related damage
Consistent topography with moderate elevation changes, which facilitates the building and testing of a small, scale model
Crops with predictable and consistent yields (corn and wheat are easier to estimate than say apples or peaches)
An ecosystem prone to extreme weather events from freezing cold, to droughts, to tornadoes and hail.

Within Iowa, Marshall county became the choice because it is a highly agricultural county with known historical crop-loss events that vary from year to year and storm to storm.

# data for maps is included in package
states <- map_data("state")
iowa_geo <- subset(states, region == "iowa")

counties <- map_data("county")
iowa_county <- subset(counties, region == "iowa")

marshall <- subset(iowa_county, subregion == "marshall")

base <- ggplot(data = iowa_geo, mapping = aes(x = long, y = lat, group = group)) + 
    coord_fixed(1.3) + geom_polygon(data = iowa_county, fill = NA, color = "white") + 
    geom_polygon(color = "black", fill = NA) + geom_polygon(data = marshall, 
    fill = "#a65c85b2", color = "white")
base

Marshall County

Data Sources & Acquisition

In order to do meaningful analysis of infrequent events, it was necessary to acquire as much weather and event data as possible, while at the same time maintaining the level of granularity in both sets.

National Oceanographic Atmospheric Administration’s Storm Events Data - via R package ‘noaastormevents’(directly acceses data)
National Oceanographic Atmospheric Administration’s Climate Data Online - in a CSV with data from 01-01-1999 to 12-31-2010, for Marshall County, Iowa

Reading Data from NOAA Storm Events Database

The data comes in with 15 fields:

begin_date	end_date	state	cz_type	cz_name	event_type	source	damage_property	fips
1999-01-01	1999-01-01	Louisiana	C	Bossier	Tornado	NWS STORM SURVEY	60000	22015
1999-01-01	1999-01-01	Louisiana	C	Webster	Tornado	NWS STORM SURVEY	88000	22119
1999-01-01	1999-01-01	Texas	C	Trinity	Hail	LAW ENFORCEMENT	5000	48455
1999-01-01	1999-01-01	Texas	C	Montgomery	Tornado	TRAINED SPOTTER	75000	48339
1999-01-01	1999-01-01	Texas	C	Liberty	Tornado	NWS STORM SURVEY	175000	48291
1999-01-01	1999-01-01	Texas	C	Walker	Tornado	LAW ENFORCEMENT	15000	48471

Comparing Weather Driven Crop Loss Across Iowa

From year to year loss events differ in numbers and degree. The following maps are here to give a sense of how variable weather related crop loss can be in Iowa.

2001 Crop Damage & Number of Storm Events

map_events(Iowa_events_01, states = "Iowa", plot_type = "crop damage")

map_events(Iowa_events_01, states = "Iowa", plot_type = "number of events")

2005 Crop Damage & Number of Storm Events

map_events(Iowa_events_05, states = "Iowa", plot_type = "crop damage")

map_events(Iowa_events_05, states = "Iowa", plot_type = "number of events")

Marshall County Events Data

Because this is our target data for the model, all that was kept were the starting and ending event dates, as well as the crop damages for each of the storms reported for Marshall County, Iowa.

begin_date	end_date	damage_crops
1999-01-01	1999-01-02	0
1999-02-11	1999-02-11	0
1999-03-08	1999-03-08	0
1999-03-17	1999-03-17	0
1999-04-06	1999-04-15	0
1999-04-08	1999-04-08	0
1999-04-08	1999-04-08	0
1999-04-22	1999-04-27	0
1999-05-10	1999-05-10	0
1999-05-16	1999-05-16	0
1999-05-16	1999-05-20	25000
1999-05-21	1999-05-25	10000
1999-06-04	1999-06-04	5000
1999-06-08	1999-06-08	3000
1999-06-09	1999-06-15	75000

Reading in the Weather Data

There were multiple sources of daily weather data within NOAA, my chosen one was easily accessible at the county level using the R package countyweather.

However, at the time of the analysis, the package was experiencing technical difficulties accessing granular data to convert to a county level 15-minute intervals for a lagged linear model.

Instead daily averages were acquired in csv format, using the Climate Data Search Tool.

Weather Data Frame

TOBS	TMIN	TMAX	SNWD	SNOW	PRCP	DATE	ELEVATION	LONGITUDE	LATITUDE
7	-10	8	3	0	0.00	1999-01-01	265.2	-92.9244	42.0647
16	7	16	9	6	0.73	1999-01-02	265.2	-92.9244	42.0647
9	9	18	12	3	0.20	1999-01-03	265.2	-92.9244	42.0647
-12	-12	9	8	0	0.00	1999-01-04	265.2	-92.9244	42.0647
-10	-26	-7	8	0	0.00	1999-01-05	265.2	-92.9244	42.0647
21	-10	21	8	0	0.02	1999-01-06	265.2	-92.9244	42.0647

Summary Statistics of Weather Variables

	vars	n	mean	sd	median	trimmed	mad	min	max	range	skew	kurtosis	se
TOBS	1	4378	41.9641389	20.3754533	44	43.0724886	23.7216	-29	81.00	110.00	-0.4111246	-0.5193553	0.3079423
TMIN	2	8726	37.3964016	20.9207917	38	38.5012890	23.7216	-34	78.00	112.00	-0.3830204	-0.6182564	0.2239600
TMAX	3	8725	58.6043553	22.3965865	62	59.9469990	26.6868	-10	99.00	109.00	-0.4305054	-0.8745963	0.2397724
SNWD	4	8262	0.4852336	1.8209297	0	0.0045386	0.0000	0	26.00	26.00	5.0385890	31.7862020	0.0200332
SNOW	5	9776	0.0721358	0.5007687	0	0.0000000	0.0000	0	13.00	13.00	10.2348102	139.0760720	0.0050647
PRCP	6	14450	0.0975612	0.3097190	0	0.0204663	0.0000	0	4.83	4.83	5.7573139	47.1445766	0.0025765

Combining Data for Analysis

Sets are combined to facilitate a linear regression model intended to predict the dollar value of loss associated with each storm event.

This allows supervised methods to access predictors and outcome variables in a single data frame

marshall_whole <- left_join(marshall_all_weather, marshall_events, by = c(DATE = "end_date"))
marshall_whole <- marshall_whole %>% mutate(damage_crops = round(damage_crops, 
    0))

This is the resulting set:

TOBS	TMIN	TMAX	SNWD	SNOW	PRCP	DATE	ELEVATION	LONGITUDE	LATITUDE	damage_crops
7	-10	8	3	0	0.00	1999-01-01	265.2	-92.9244	42.0647	NA
16	7	16	9	6	0.73	1999-01-02	265.2	-92.9244	42.0647	0
9	9	18	12	3	0.20	1999-01-03	265.2	-92.9244	42.0647	NA
-12	-12	9	8	0	0.00	1999-01-04	265.2	-92.9244	42.0647	NA
-10	-26	-7	8	0	0.00	1999-01-05	265.2	-92.9244	42.0647	NA

Initial Model

Based on the most complete variables in the data set, with less than 30% NA values, were used to builld the model’s training and test sets.

divider <- nrow(marshall_whole) * 0.7
train <- marshall_whole[1:divider, ]
test <- marshall_whole[divider + 1:nrow(marshall_whole), ]

Independent Variables

-Minimum Daily Temperature in 5/100ths of a degree - Maximum Daily Temperature in 5/100ths of a degree - Daily Aggregate of Precipitation (by weight)

fit <- lm(damage_crops ~ TMAX + TMIN + PRCP, data = train, na.action = na.omit)
require(xtable)
print(xtable(summary(fit)), type = "html", floating = TRUE)

	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	-592118.3112	5731357.1640	-0.10	0.9178
TMAX	17373.9025	218495.9227	0.08	0.9367
TMIN	40754.8016	233753.6291	0.17	0.8618
PRCP	-2291742.2985	3711319.5818	-0.62	0.5378

Based on the summary statistics, it is pretty easy to see that none of the variables is significant.

The $R^2$ values confirm the suspicion that this may not be a highly predictive model.

$R^2$ : 0.0062327

Adjusted $R^2$ : -0.0125177

The real test is in the model’s ability to predict a loss, and accurately quantify loss.

Predictions & Evaluation

Model Accuracy: 46.2 %

Estimate of Business Performance

Payouts <- test %>% filter(test$Correct == 1 & test$Prediction > 0 & !is.na(test$Correct)) %>% 
    mutate(Discrepancy = damage_crops - Prediction) %>% select(DATE, TMIN, TMAX, 
    PRCP, damage_crops, Prediction, Correct, Discrepancy)

Net Discrepancy: $\$$ 167483616

Average Discrepancy: $\$$ 3418033

Standard Deviation of Discrepancy: $\$$ 35699571

From looking at events one-by-one on this table you can see clearly that we did not do a great job of predicting loss. So, where to next?

How Does Time Affect the Contribution of Variables to the Model?

Gasparini Distributed Lag Models

Each predictor varible is passed into a function, which allows the consideration of a variety of short and long-term lags in the estimation of a global model
Tempratures were give 5-day lag windows to account for localized heat or cold extremes
Precipitation was given a 20-day lag window to allow for possible flood and draught conditions

# Build the lag into variables
cross.tmax <- crossbasis(train$TMAX, lag = 5, argvar = list(df = 5), arglag = list(fun = "strata", 
    breaks = 1))
cross.tmin <- crossbasis(train$TMIN, lag = 5, argvar = list(df = 5), arglag = list(fun = "strata", 
    breaks = 1))
cross.prcp <- crossbasis(train$PRCP, lag = 20, argvar = list(fun = "lin"), arglag = list(fun = "poly", 
    degree = 4))

# Construct regression using the lagged variables
new_fit <- lm(damage_crops ~ cross.prcp + cross.tmax + cross.tmin, train)


# Predict the new fit over increment of .1 using a cumulative lag paradigm
pre.tmax <- crosspred(cross.tmax, new_fit, at = 0:10, bylag = 0.1, cumul = TRUE)

# Plots of Graphs for Each Predictor incorporating Lag
mar.default <- c(5, 4, 4, 2) + 0.1
par(mar = mar.default + c(0, 4, 0, 0))
plot(pre.tmax, "slices", var = 5, col = 3, xlim = c(0, 5), las = 1, ylim = c(-600000, 
    1000000), ylab = "", ci.arg = list(density = 25, lwd = 2), main = "Lag Association of Daily Maximum Temperature")

par(mar = mar.default + c(0, 4, 0, 0))
pre.prcp <- crosspred(cross.tmin, new_fit, at = 0:20, bylag = 0.2, cumul = TRUE)
plot(pre.prcp, "slices", var = 5, col = 3, xlim = c(-0.1, 5.5), las = 1, ylim = c(-1900000, 
    1250000), ylab = "", ci.arg = list(density = 25, lwd = 2), main = "Lag Association of Daily Minimim Temperature")

pre.prcp <- crosspred(cross.prcp, new_fit, at = 0:20, bylag = 0.2, cumul = TRUE)

par(mar = mar.default + c(0, 4, 0, 0))
plot(pre.prcp, "slices", var = 5, col = 2, xlim = c(-1, 22), las = 1, ylim = c(-14650000, 
    29000000), ylab = "", ci.arg = list(density = 25, lwd = 2), main = "Lag Association of Daily Precipitation")

Although no final linear model is constructed from the lag-converted predictor variables, there is a clear sense that the contribution a variable has on the outcome varies with the inclusion of past days weather events, particularly in the case of precipitation. # Conclusions

Even a model built upon limited data has some ability to predict whether or not a loss occured (currently at 46% accuracy)
With more frequent observations, individual claims data and a well constructed lag-model the probability of predicting a loss would likely adequate to justify an automated claim

References

Search | Climate Data Online (CDO) | National Climatic Data Center (NCDC). (n.d.). Retrieved December 11, 2017, from https://www.ncdc.noaa.gov/cdo-web/search?datasetid=GHCND

Overview of noaastormevents. (n.d.). Retrieved December 11, 2017, from https://cran.r-project.org/web/packages/noaastormevents/vignettes/noaastormevents.html

Datasets | Climate Data Online (CDO) | National Climatic Data Center (NCDC). (n.d.). Retrieved December 11, 2017, from https://www.ncdc.noaa.gov/cdo-web/datasets#GHCND

Gasparrini, A., Scheipl, F., Armstrong, B., & Kenward, M. G. (2017). A penalized framework for distributed lag non-linear models. Biometrics, 73(3), 938–948. https://doi.org/10.1111/biom.12645

Gasparrini A. Distributed lag linear and non-linear models in R: the package dlnm. Journal of Statistical Software. 2011;43(8):1-20.

Gasparrini A, Armstrong B, Kenward MG. Distributed lag non-linear models. Statistics in Medicine. 2010;29(21):2224-2234.

Predicting Marshall County Crop Loss Using Historical Weather Data

Bethany Poulin

Predicting Marshall County Crop Loss Using Historical Weather Data

Experimental Crop Damage Model

Motivation for Analysis

Practical Performance Guidelines

Assumptions

Choosing a Geography to Analyze

Marshall County

Data Sources & Acquisition

Reading Data from NOAA Storm Events Database

Comparing Weather Driven Crop Loss Across Iowa

2001 Crop Damage & Number of Storm Events

2005 Crop Damage & Number of Storm Events

Marshall County Events Data

Reading in the Weather Data

Weather Data Frame

Summary Statistics of Weather Variables

Combining Data for Analysis

Initial Model

Independent Variables

Predictions & Evaluation

Estimate of Business Performance

How Does Time Affect the Contribution of Variables to the Model?

Gasparini Distributed Lag Models

References