ESP 106 Final Project: Walnuts and Pesticides in California

Introduction

Walnuts are important agricultural commodity in California. About 99% of walnuts in the United States are grown in California, and walnuts are consistently one of California’s top ten valued agricultural commodities. As a high-value crop, growers must invest in crop protection measures, such as pesticides. In the 2017 and 2018 California Department of Pesticide Regulation (CDPR) Pesticide Use Annual Reports, walnut is listed as a key commodity of interest based on the pounds of active ingredients and acreage treated.

Understanding pesticide use is crucial for maintaining productive crops and limiting adverse effects, such as pollution, health effects, and pesticide resistance. CDPR publishes reports and summaries of pesticide use that focus on year to year changes and summary statistics.

In this project, I do some analyses similar to the CDPR annual reports, but usually in a more general scope. CDPR chooses key pesticide chemicals to investigate for each commodity, whereas I use broader pesticide categories. I also focus on only one year’s worth of pesticide applications and visualize usage over time. Also, the paid CDPR data analyst has the patience to make a township-range level map, but here I settle for counties. A step I take beyond what is in CDPR’s annual reports is to construct models of pesticide use based on the size of a region’s walnut industry.

Data Sources

USDA NASS Quick Stats: Where are the walnuts?

I accessed the USDA 2017 Census of Agriculture data through the NASS Quick Stats API using the rnassqs package. I used the following query parameters to generate my initial data frame.

param <- list(
  state_name = "CALIFORNIA",
  commodity_desc = "WALNUTS",
  statisticcat_desc = c("AREA BEARING", "AREA NON-BEARING"),
  source_desc = "CENSUS",
  agg_level_desc = "COUNTY",
  domain_desc = "TOTAL",
  year = 2017
)
acres <- nassqs(param)

The data frame acres has contains the acreage and the number of farm operations with walnuts for each county. I further divided the data using group_split(acres, short_desc), to separate the data into four categories: operations with acres bearing, operations with acres non-bearing, acres bearing, and acres non-bearing. “Bearing” and “non-bearing” acres signifies whether the walnut stand is bearing nuts or not bearing nuts. The analysis continues using the operation data frames made from the four tibbles, since the acreage data contained mostly NA values. When reading in these data frames from .csv files, the leading zero in the COUNTYFP column is dropped and must be replaced.

ops_bear <-  read.csv("data/clean/walnut_operations_bearing.csv")
ops_non <-  read.csv("data/clean/walnut_operations_nonbearing.csv")

ops_bear$COUNTYFP <- formatC(ops_bear$COUNTYFP, 
                             width = 3, 
                             format = "d", 
                             flag = "0")
ops_non$COUNTYFP <- formatC(ops_non$COUNTYFP, 
                            width = 3, 
                            format = "d", 
                            flag = "0")

CalPIP Pesticide Use Reporting Data

I used the Pesticide Use Report (PUR) data from the California Pesticide Information Portal. I used the query menus to get pesticide use data for 2017 on walnuts. The resulting text file needed some cleaning.

pesticide <- read.delim("data/pesticide_walnut_raw.txt")
pesticide$COUNTY_NAME <- as.factor(pesticide$COUNTY_NAME)
pesticide$PRODUCT_NAME <- as.factor(pesticide$PRODUCT_NAME)
pesticide$CHEMICAL_NAME <- as.factor(pesticide$CHEMICAL_NAME)
pesticide$DATE <- as.Date(pesticide$DATE, format = "%d-%b-%y")
pesticide <- pesticide[, c(3:5,7:13)]

The final data frame contains over 250,000 entries showing every pesticide application on walnuts in 2017 by date, township-range, county, area treated, pesticide product name, active ingredient name, and the amounts applied.

pesticide <- read.csv("data/clean/pesticide_final.csv")

CDPR Pesticide Database

To categorize specific pesticide products into general pesticide categories, I used the CDPR “Search for Pesticide Products Using Multiple Variables” query, which allows users to search for pesticide products registered for use on specific commodities that fall under different pesticide categories. I used the site code “3009” corresponding to English walnuts and selected a pesticide category, and checked the box to select all results to make the final output table. I only included currently registered products, which might not exactly match the products registered in 2017. I copied each product, chemical, and site data report into a .csv file and applied a function to each file to condense the information to a list of pesticide products for each pesticide type.

only_cide_name <- function(file, path){
  x <- read.csv(paste0(path, "/", file))
  unique(x[,"Product.Name"])
}
path = "data/pesticide_names_raw"
cide_files <- list.files(path, pattern = ".csv")
pesticide_names_all <- sapply(cide_files, FUN = only_cide_name, path = path)
names(pesticide_names_all) <- sapply(X = cide_files, 
                                     sub, pattern = "_names.csv", replacement = "")

Pesticide Usage Counts and Category Matching

To match the pesticide products in the pesticide use data to pesticide categories, I created a table of the product names from the pesticide data, and added a column for each category. The column values were TRUE or FALSE to indicate the product was or was not listed in that category. Many products remained unclassified after the matching. For some products, there seems to be a mistake, but many of the uncategorized products are adjuvants and other additives that are designed to aid application, not target pests.

# create the table
pesticide_count <- table(pesticide$PRODUCT_NAME)
pesticide_count <- as.data.frame(pesticide_count)
colnames(pesticide_count) <- c("name", "count")
pesticide_count <- pesticide_count[order(pesticide_count$count, decreasing = TRUE),]

# make a list of the category files
path = "data/pesticide_names_final"
cide_files <- list.files(path, pattern = ".csv")
new_colnames <- sapply(X = cide_files, 
                       sub, pattern = ".csv", replacement = "")

# read each file and create the TRUE/FALSE columns
for(i in 1:length(cide_files)){
  x <- read.csv(paste0(path, "/", cide_files[i]))
  y <- as.data.frame(pesticide_count$name %in% x[,new_colnames[i]])
  names(y) <- new_colnames[i]
  pesticide_count <- cbind(pesticide_count, y)
}

I also added the total pounds applied for each product using the following loop.

lbs <- data.frame()
for (i in pesticide_count$name){
  x <- pesticide[pesticide$PRODUCT_NAME == i,]
  y <- data.frame(name = i, lbs = sum(x$POUNDS_PRODUCT_APPLIED, na.rm = TRUE))
  lbs <- rbind(lbs, y)
}
pesticide_count <- merge(lbs, pesticide_count)

I created a column for the pounds of active ingredient applied in the same way, and saved the final data frame as a .csv file.

pesticide_count <- read.csv("data/clean/pesticide_count.csv")

Map of California Counties

To plot maps, I used the CA County Boundaries data from the California Open Data Portal.

ca <- vect("data/ca_counties/CA_counties_TIGER2016.shp")

Data Exploration

Walnut Operations

To plot the number of walnut operations in each county, I merged the operation data with the county map vector data.

ca_op <- merge(ca, ops_bear, by = "COUNTYFP", all = TRUE)
ca_op_non <- merge(ca, ops_non, by = "COUNTYFP", all = TRUE)

Next, I chose a yellow-green gradient for my map colors.

colors <- brewer.pal(9, "YlGn")
col_ramp <- colorRampPalette(colors)

Finally, I plotted the maps. While the scales are different, the colors show the same percentiles in both plots.

plot(ca_op, "Value", 
     type = "continuous", 
     col = col_ramp(20),
     axes = FALSE,
     main = "Walnut Operations: Bearing")
lines(ca_op, col = "grey58", lwd = 1)

plot(ca_op_non, "Value", 
     type = "continuous", 
     col = col_ramp(20),
     axes = FALSE,
     main = "Walnut Operations: Non-bearing")
lines(ca_op, col = "grey58", lwd = 1)

The maps are very similar, and there are a few counties with many more operations than the rest of the state. The counties with the most operations with acres bearing are:

head(ops_bear[order(ops_bear$Value, decreasing = TRUE), "county_name"])

## [1] "SAN JOAQUIN" "STANISLAUS"  "BUTTE"       "SUTTER"      "TULARE"     
## [6] "LAKE"

The counties with most operations with non-bearing acres are:

head(ops_non[order(ops_non$Value, decreasing = TRUE), "county_name"])

## [1] "SAN JOAQUIN" "STANISLAUS"  "SUTTER"      "BUTTE"       "TULARE"     
## [6] "GLENN"

Plotting Pesticide Use

The new_colnames vector used briefly to create the pesticide_count data frame became very useful for looping through the pesticide categories. I wrote a function, plot_prep, to add up the variable of interest by the pesticide category.

plot_prep <- function(x){
  v <- data.frame()
  for(i in new_colnames){
  z <- pesticide_count[pesticide_count[,i] == TRUE, x]
  y <- data.frame(category = as.factor(i), 
                  value = sum(z))
  v <- rbind(v, y)
} 
v[order(v$value, decreasing = TRUE),]
}

Then, I used the function to generate the data for the following plots.

apps <- plot_prep("count")
par(mar = c(5,10,4,5))
barplot(apps$value,
        las = 1,
        horiz = TRUE,
        cex.names = 0.8,
        xlab = "Applications",
        main = "Number of Applications by Pesticide Category",
        names.arg = apps$category,
        axes = FALSE)
box(bty = "l")
axis(1, at = seq(0, 35000, 5000),
     cex.axis = 0.8)

pounds <- plot_prep("lbs")
par(mar = c(5,10,4,5))
barplot(pounds$value,
        las = 1,
        horiz = TRUE,
        cex.names = 0.8,
        xlab = "Amount Applied (millions of pounds)",
        main = "Pounds of Pesticide Applied by Category",
        names.arg = pounds$category,
        axes = FALSE)
box(bty = "l")
axis(1, at = seq(0, 5000000, 1000000), 
     labels = seq(0,5,1),
     cex.axis = 0.8)

pounds_chem <- plot_prep("lbs_chem")
par(mar = c(5,10,4,5))
barplot(pounds_chem$value,
        las = 1,
        horiz = TRUE,
        cex.names = 0.8,
        xlab = "Amount Applied (millions of pounds)",
        main = "Pounds of Active Ingredient Applied by Category",
        names.arg = pounds_chem$category,
        axes = FALSE)
box(bty = "l")
axis(1, at = seq(0, 4000000, 1000000), 
     labels = seq(0,4,1),
     cex.axis = 0.8)

From these plots, the pesticides most used on walnuts are generally fungicides, insecticides, bactericides, and herbicides. The rest of the analysis focuses on these major categories.

Data Analysis

Mapping Intensity of Pesticide Use

For these maps, I used a yellow-red color palette.

colors <- brewer.pal(9, "YlOrRd")
col_ramp <- colorRampPalette(colors)

Then, I made a function that would produce a county-level map of the pounds of active ingredient applied per acre for a given pesticide type. The calculation uses total cumulative acres treated, meaning that one acre treated twice is counted as two acres treated.

county_name_fps <- ops_bear[, c("COUNTYFP", "county_name")]
pesticide_cat_map <- function(x, title){
  y <- pesticide[pesticide$PRODUCT_NAME %in% pesticide_count[pesticide_count[,x] == TRUE, "name"] & pesticide$UNIT_TREATED == "A",] %>%
        group_by(COUNTY_NAME)%>%
        summarize(chemlbs = sum(as.numeric(POUNDS_CHEMICAL_APPLIED), na.rm = TRUE), treated = sum(AMOUNT_TREATED, na.rm = TRUE)) %>%
        mutate(chemperacre = chemlbs/treated)
  names(y) <- c("county_name", "chemlbs", "treated", "chemperacre")
  y <- merge(y, county_name_fps, by = "county_name", all = TRUE)
  ca_y <- merge(ca, y, by = "COUNTYFP", all = TRUE)
  plot(ca_y, "chemperacre",
     type = "continuous",
     col = col_ramp(20),
     axes = FALSE,
     main = title)
  lines(ca_op, col = "grey58", lwd = 1)
  ca_y
}

The function makes a plot and creates a SpatVector with attribute data that I use later on for regression.

par(mfrow = c(2,2), oma = c(0, 0, 2, 0))
fungi <- pesticide_cat_map("fungicide", "Fungicides")
bact <- pesticide_cat_map("bactericide", "Bactericides")
herb <- pesticide_cat_map("herbicide", "Herbicides")
insect <- pesticide_cat_map("insecticide", "Insecticides") 
mtext("Pounds of Active Ingredient per Acre", outer = TRUE)

With these maps, we can see that some counties with relatively few walnut operations have high pesticide use intensity by these metrics, especially for fungicides, herbicides, insecticides. On the other hand, bactericide use intensity is fairly high in some of those counties with many walnut operations.

The top fungicide users were:

top_fungi <- as.data.frame(fungi)
head(top_fungi[order(top_fungi$chemperacre, 
                     decreasing = TRUE), "county_name"])

## [1] "MADERA"      "KERN"        "SAN BENITO"  "PLACER"      "MERCED"     
## [6] "SAN JOAQUIN"

Bactericides:

top_bact <- as.data.frame(bact)
head(top_bact[order(top_bact$chemperacre, 
                     decreasing = TRUE), "county_name"])

## [1] "KERN"       "PLACER"     "SHASTA"     "TEHAMA"     "BUTTE"     
## [6] "STANISLAUS"

Herbicides:

top_herb <- as.data.frame(herb)
head(top_herb[order(top_herb$chemperacre, 
                     decreasing = TRUE), "county_name"])

## [1] "SANTA CLARA"  "VENTURA"      "MONTEREY"     "LAKE"         "CONTRA COSTA"
## [6] "YUBA"

And finally, insecticides:

top_insect <- as.data.frame(insect)
head(top_insect[order(top_insect$chemperacre, 
                     decreasing = TRUE), "county_name"])

## [1] "MENDOCINO"  "SAN BENITO" "MADERA"     "COLUSA"     "CALAVERAS" 
## [6] "FRESNO"

Regression of Pesticide Use against the Number of Walnut Operations: Interesting Cases

To quantify the relationship between the number of walnut operations and pesticide use, I made some linear regression models. Similar to the previous section, I wanted to highlight some counties that might have unusual pesticide use.

The regression of pounds active ingredient applied against the number of operations is not very interesting on its own, so I will only present the \(R^2\) values and influence plots showing the studentized residuals, Hat-values, and Cook’s distance (as dot size). The influencePlot function in the car package will produce this plot. With these plots, we can see the outlying values in pesticide use, outlying values in number of walnut operations, and the observations that have a large impact on the regression model. Since I had to repeat this plotting four times, I made a function.

chem_ops_outliers <- function(x, name){
  y <- merge(x, ops_bear, by = "COUNTYFP", all = TRUE) %>%
    as.data.frame()
  z <- lm(chemlbs ~ Value, data = y)
  influencePlot(z, main = name)
  summary(z)
}

Next, I used the data frames produced for the previous maps as the inputs to the function.

influence_fungi <- chem_ops_outliers(fungi, "Fungicides")

The circles corresponding to outliers and influencers are labeled with the observation index. I am most interested in counties with especially high or low pesticide use for a given number of walnut operations, which are cases with large studentized residuals. In the fungicide use regression, observation 44 is an influential case with higher pesticide use than predicted. Observation 52 has much lower fungicide use than expected. This regression model has an \(R^2\) value of 0.809.

influence_bact <- chem_ops_outliers(bact, "Bactericides")

The major outliers for bactericide use are cases 44 and 46, and the \(R^2\) of the regression model is 0.688

influence_herb <- chem_ops_outliers(herb, "Herbicides")

Cases 52 and 34 are outliers in herbicide use, and the \(R^2\) for this model is 0.838.

influence_insect <- chem_ops_outliers(insect, "Insecticides")

Finally, the outliers for insecticide use are cases 34, 41, and 44. In this regression, the \(R^2\) value is 0.764.

Many of the same observations are considered outliers or influential cases across the different pesticide categories.

as.data.frame(fungi)[c(34,41,44,46,47,52), "NAME"]

## [1] "Colusa"      "San Joaquin" "Butte"       "Tulare"      "Stanislaus" 
## [6] "Lake"

Most of these counties are those with many walnut operations. Colusa county does not have many walnut operations, but uses a higher amount of insecticides and herbicides than the linear regression model predicts. Butte county is highly influential for all four pesticide categories and has high fungicide and bactericide use, but low insecticide use.

Pesticide Applications Over a Year

Lastly, I wanted to show how pesticide use in 2017 changed over the year.

pound_dates <- pesticide %>%
  group_by(DATE) %>%
  summarize(pounds = sum(as.numeric(POUNDS_CHEMICAL_APPLIED), na.rm = TRUE))
pound_dates$DATE <- as_date(pound_dates$DATE)

I repeated this for the pesticide categories by turning the above code into a function.

pesticide_lbs <- function(x){
  z <- pesticide[pesticide$PRODUCT_NAME %in% 
              pesticide_count[pesticide_count[,x] == TRUE, "name"],] %>%
    group_by(DATE) %>%
    summarize(pounds = sum(as.numeric(POUNDS_CHEMICAL_APPLIED), na.rm = TRUE))
  z$DATE <- as_date(z$DATE)
  z
}

I also made a function that produces the plot.

pound_date_plot <- function(x, title){
  ggplot(x, aes(x = DATE, y = pounds)) +
    geom_line(lwd = 1, col = "grey58", alpha = 0.7) +
    ggtitle(title) +
    scale_x_date(date_labels = "%b") +
    scale_y_continuous(limits = c(0,200000)) +
    theme_bw() +
    theme(axis.title.x = element_blank(), axis.title.y = element_blank())
}

Then, I used the functions to make the plot objects.

fungi_lb_date <- pesticide_lbs("fungicide") %>%
  pound_date_plot("Fungicides")
bact_lb_date <- pesticide_lbs("bactericide") %>%
  pound_date_plot("Bactericides")
herb_lb_date <- pesticide_lbs("herbicide") %>%
  pound_date_plot("Herbicides")
insect_lb_date <- pesticide_lbs("insecticide") %>%
  pound_date_plot("Insecticides")

Finally, I ended up with these plots

all_pest <- pound_date_plot(pound_dates, "All Pesticides")
grid.arrange(all_pest,fungi_lb_date, bact_lb_date, herb_lb_date, insect_lb_date,
             layout_matrix = rbind(c(1, NA), c(2,3),c(4,5)),
             top = "Pesticide Use 2017",
             left = "Pounds Active Ingredients",
             bottom = "Date")

For fungicides, bactericides, and insecticides, there are obvious periods time in the spring and summer when pesticide application spikes, whereas herbicides seem to be used in low amounts across most of the year. The spring and summer peaks fall in the flowering and fruiting season, and the autumn peak for all pesticides are likely harvest or post-harvest treatments.

Conclusion

In this project, I tried a few methods to find counties with exceptional pesticide use on walnuts and to understand pesticide use in general. From an industry stand-point, this information is not very useful. Pest management plans are structured around specific pests and locations, not general categories and counties. However, as a student, it is interesting to look at the more general patterns that come up in plant pathology courses: the dominance of fungal pathogens, and the timing of pesticide applications to treat problems on the developing flowers and fruit.
Some interesting questions come out of the search for outliers. For instance, why does Lake county seem to use so little fungicide and herbicide? The regression models could explain the data fairly well, but there are probably factors such as climate and pest management choices that explain the large amounts of bactericides used in Butte county or insecticides in Colusa county. To improve what I have done here, I would attempt to use the meridian-township-range-section information, even though the meridian is replaced by the alphabetical rank of the county name, to make a more detailed map of pesticide applications. Additionally, although the acreage data from the USDA is mostly unavailable, the acreage could likely be inferred from the area treated values in the pesticide use data. This maneuver could be possible, since it would be rare for any walnut operation to have no pesticide applications. Finally, it would be interesting to start from a specific pest problem and investigate the use of pesticides that could address that pest.
I would change some things in the code, as well. There were some operations that I could have done very early on, like changing data types for columns, that I had to deal with later, but never fixed upstream. Also, I would use ggplot more in my loops and in general. I found out that a ggplot object contains the data used in the the plot as a tibble late in the project, and I wish I had known that earlier.

References

CalPIP Pesticide Use Reporting Data https://calpip.cdpr.ca.gov/main.cfm
California County Boundaries https://data.ca.gov/dataset/ca-geographic-boundaries/resource/b0007416-a325-4777-9295-368ea6b710e6
CDPR: Search for Pesticide Products Using Multiple Variables https://apps.cdpr.ca.gov/docs/label/m4.cfm
CDPR: Summary of Pesticide Use Report Data - 2018 https://www.cdpr.ca.gov/docs/pur/pur18rep/18sum.htm
CDPR: 2017 Pesticide Use Report Highlights https://www.cdpr.ca.gov/docs/pur/pur17rep/pur_highlights_2017.pdf
CDPR: 2018 Pesticide Use Report Highlights https://www.cdpr.ca.gov/docs/pur/pur18rep/pur_highlights_2018.pdf
UC Davis Fruit and Nut Research Information Center http://fruitandnuteducation.ucdavis.edu/fruitnutproduction/Walnut/#:~:text=In%20California%2C%20walnuts%20flower%20in,and%20female%20(pistillate)%20flowers
USDA NASS QuickStats https://quickstats.nass.usda.gov/