Cleaning SSURGO data for mapping ecological processes

3/16/2020

Problem: Mapping horizon level soil characteristics to mapunits

Soil characteristics that are important for modelling ecosystem responses are found in the horizon level data
- Soil horizons are the vertical profile
Viewing these processes spatially requires connecting horizon level data to soil components and then to mapunits
- Each mapunit may be composed of 1 to several map components
- Each soil component consists of horizon layers
- For our models, we are interested in:
  - Soil characteristics in the first 30 cm
  - Total soil depth of each component

Before you start:

Determine the area symbol for your area of interest

To do this, locate the data online at https://websoilsurvey.sc.egov.usda.gov/App/WebSoilSurvey.aspx.
- Click the “Download Soils Data” tab
- Use the dropdown menu to populate the desired State and County option
- Record the Area Symbol
- Example: Crawford County, WI = WI023
- Example: Vernon County, WI = WI123

Before you start:

Download metadata about the columns

Information regarding each of the columns is found at:

https://www.nrcs.usda.gov/wps/portal/nrcs/detail/soils/survey/geo/?cid=nrcs142p2_053631,

Download the pdf:
- “SSURGO Metadata - Table Columns Descriptions Report”.

Create working environment in R

Built with R version 3.6.1

#load libraries
library(FedData)
library(dplyr)
#create and save functions
not_all_na <- function(x) any(!is.na(x)) #for data cleaning

A note about the FedData package

The FedData package has a function “get_ssurgo” that downloads the SSurgo data from the internet, so it is necessary to be connected to the internet.

The “get_ssurgo” function creates new directories for storing data within your working directory and will only need to be run once, as long as you continue to navigate to your working directory.

SSurgo data is stored in spatial and tabular form. We will use the tabular form.

Download SSURGO data

This is where you enter the recorded area symbol for your area of interest. Use it to fill in the template = c(“Area Symbol”). ‘label’ is the name of the folder that will be created in your working directory.

SSURGO.areas.023 <- get_ssurgo(template=c('WI023'), 
                               label='WI_023') #Crawford County, WI
SSURGO.areas.123 <- get_ssurgo(template=c('WI123'), 
                               label='WI_123') #Vernon County, WI

Name and store data frames from downloaded data

Create a chorizon, component, mapunit dataset from each of our areas of interest

crawford_chorizon <- SSURGO.areas.023$tabular$chorizon
crawford_component <- SSURGO.areas.023$tabular$component
crawford_mapunit <- SSURGO.areas.023$tabular$mapunit

vernon_chorizon <- SSURGO.areas.123$tabular$chorizon
vernon_component <- SSURGO.areas.123$tabular$component
vernon_mapunit <- SSURGO.areas.123$tabular$mapunit

Join Crawford and Vernon respective datasets

First, create a county variable for mapunit so that we can place data within a county in the final data set.

chorizon <- rbind(crawford_chorizon, vernon_chorizon)
component <- rbind(crawford_component, vernon_component)

crawford_mapunit$county <- "crawford"
vernon_mapunit$county <- "vernon"
mapunit <- rbind(crawford_mapunit, vernon_mapunit)

View sample data: chorizon

chorizon[1:9,1:5]

## # A tibble: 9 x 5
##   hzname desgndisc desgnmaster desgnmasterprime desgnvert
##   <chr>      <dbl> <chr>       <chr>                <dbl>
## 1 BC            NA BC          <NA>                    NA
## 2 2C             2 C           <NA>                    NA
## 3 Ap            NA A           <NA>                    NA
## 4 BE            NA BE          <NA>                    NA
## 5 Bt            NA B           <NA>                    NA
## 6 2Cr            2 C           <NA>                    NA
## 7 Ap            NA A           <NA>                    NA
## 8 2Bt            2 B           <NA>                    NA
## 9 Bt            NA B           <NA>                    NA

View sample data: component

component[1:9,1:5]

## # A tibble: 9 x 5
##   comppct.l comppct.r comppct.h compname   compkind          
##       <dbl>     <dbl>     <dbl> <chr>      <chr>             
## 1        NA       100        NA Seaton     Series            
## 2        NA       100        NA Seaton     Series            
## 3        NA       100        NA Seaton     Series            
## 4         1         3         5 Plainfield Series            
## 5        90        95       100 Chelsea    Series            
## 6         1         2         5 Forkhorn   Series            
## 7        NA       100        NA Pits       Miscellaneous area
## 8         1         3        10 Balmoral   Series            
## 9         1         2         5 Muscoda    Series

View sample data: mapunit

mapunit[1:9,1:5]

## # A tibble: 9 x 5
##   musym muname                                        mukind    mustatus muacres
##   <chr> <chr>                                         <chr>     <lgl>      <dbl>
## 1 112B2 Seaton silt loam, 2 to 6 percent slopes, vfs… Consocia… NA           210
## 2 112C2 Seaton silt loam, 6 to 12 percent slopes, vf… Consocia… NA           488
## 3 112D2 Seaton silt loam, 12 to 20 percent slopes, v… Consocia… NA           630
## 4 502D  Chelsea fine sand, 15 to 30 percent slopes    Consocia… NA            62
## 5 2022  Pits, siliceous sand                          Consocia… NA            14
## 6 435D2 Nuxmaruhanixete silt loam, 12 to 20 percent … Consocia… NA           118
## 7 1130F Lacrescent-Dunbarton complex, very stony, 30… Complex   NA          5908
## 8 115C2 Seaton silt loam, driftless ridge, 6 to 12 p… Consocia… NA         23573
## 9 224B  Elevasil sandy loam, 2 to 6 percent slopes    Consocia… NA            13

Data observations

Note that after cleaning these data sets, we will be combining the chorizon to the component based on cokey. We will then combine that component_chorizon data to the mapunit data based on mukey.

However, in the chorizon file, there are 915 unique cokeys, while in the component data set, there are 942 unique cokeys. This means that there are some components without horizon level data.

There are an equal number of unique mapunit keys between the component and mapunit datasets: 271.

Clean data: remove columns that are empty

Many columns in the data sets are filled only with NAs. This is likely a result of the soil survey process itself. Removing these columns makes the datasets feel more manageable. We start with 171 columns in chorizon, 109 columns in component, and 25 columns in mapunit.

chorizon <- chorizon %>% 
  select_if(not_all_na)
component <- component %>% 
  select_if(not_all_na)
mapunit <- mapunit %>%
  select_if(not_all_na)

After this, we have 139 columns in chorizon, 78 columns in component, and 11 columns in mapunit.

Chorizon: Determine total soil depth

Create a new dataframe that consists of the deepest horizon bottom for each cokey
We will join this back to the chorizon dataset in a later step

#deepest horizon bottom of each component
depth <- chorizon %>%
  group_by(cokey) %>%
  summarise(total.depth = max(hzdepb.r))
head(depth)

## # A tibble: 6 x 2
##      cokey total.depth
##      <dbl>       <dbl>
## 1 18042863         203
## 2 18042864         203
## 3 18042865         203
## 4 18042866         204
## 5 18042867         152
## 6 18042868         204

Chorizon: Filter to horizon depth up to 30 cm

#remove horizons that start below 30 cm
chorizon <- chorizon %>%
  filter(hzdept.r < 30) %>% 
  droplevels()

Chorizon: dealing with multiple horizons in the top 30 cm

We want to have only one observation per cokey for the join from chorizon data to component data. So we will need to further process the chorizon data.

component_count <- chorizon %>%
  group_by(cokey) %>%
  summarize(count = n()) %>%
  filter(count > 1)

nrow(component_count)

## [1] 811

There are 811 components with more than one horizon layer in the first 30 cm.

Chorizon: summarizing data across multiple horizons for each cokey

To clean this, we will summarize characteristics of interest with a weighted mean of the horizons present.

The first step in this process is to create a variable ‘thick’ which will be used to relate the amount of soil in each horizon in each component.
For example, if the top horizon goes from 0-20 cm and the next horizon goes from 20-48 cm:
- ‘thick’ for the first level will equal 20
- ‘thick’ for the second level will be the remainder up to 30 cm = 10

Chorizon: Code to compute weighted means

chorizon <- chorizon %>%
  mutate(thick = ifelse(hzdepb.r > 30, 30 - hzdept.r, 
                        hzdepb.r - hzdept.r)) %>%  
  group_by(cokey) %>%
  summarise(sand = round(weighted.mean(sandtotal.r, thick, na.rm = TRUE),2),
            silt = round(weighted.mean(silttotal.r, thick, na.rm = TRUE),2),
            clay = round(weighted.mean(claytotal.r, thick, na.rm = TRUE),2),
            om = round(weighted.mean(om.r, thick, na.rm = TRUE),2),
            ksat = round(weighted.mean(ksat.r, thick, na.rm = TRUE),2),
            k = round(weighted.mean(kffact, thick, na.rm = TRUE),2),
            cec = round(weighted.mean(cec7.r, thick, na.rm = TRUE),2),
            ph = round(weighted.mean(ph1to1h2o.r, thick),2))

Chorizon: Join computed weighted means with deepest soil depth

chorizon <- left_join(chorizon, depth, by = "cokey")

Now we have one observation per component for each of the variables we care about.

dim(chorizon)

## [1] 915  10

Chorizon: cleaned dataset

head(chorizon)

## # A tibble: 6 x 10
##      cokey  sand  silt   clay     om  ksat      k    cec    ph total.depth
##      <dbl> <dbl> <dbl>  <dbl>  <dbl> <dbl>  <dbl>  <dbl> <dbl>       <dbl>
## 1 18042863  66.7  23.2  10.1    2.87 23.3    0.24   5.51  6.07         203
## 2 18042864  65.6  22.6  11.8    1.88 23.3    0.2   10.4   6.2          203
## 3 18042865  66.7  23.2  10.1    2.87 23.3    0.24   5.51  6.07         203
## 4 18042866 NaN   NaN   NaN    NaN     2.33 NaN    NaN    NA            204
## 5 18042867  45.7  43.2  11.1    1.88 19.5    0.44   6.01  6.09         152
## 6 18042868  83.9  12.8   3.33   2.27 91.7    0.17   3.02  5.97         204

Note: The row with all NAs is data for the rock outcrop mapunit. We can view this later when we attach chorizon and component.

Component data: column selection

We have many columns that are of no interest. Let’s keep only the ones we want. To do this, I refer to the metadata pdf with column descriptions.

length(colnames(component))

## [1] 78

colnames(component)

##  [1] "comppct.l"        "comppct.r"        "comppct.h"        "compname"        
##  [5] "compkind"         "majcompflag"      "otherph"          "localphase"      
##  [9] "slope.l"          "slope.r"          "slope.h"          "slopelenusle.l"  
## [13] "slopelenusle.r"   "slopelenusle.h"   "runoff"           "tfact"           
## [17] "wei"              "weg"              "erocl"            "earthcovkind1"   
## [21] "earthcovkind2"    "hydricon"         "hydricrating"     "drainagecl"      
## [25] "elev.l"           "elev.r"           "elev.h"           "aspectccwise"    
## [29] "aspectrep"        "aspectcwise"      "geomdesc"         "albedodry.l"     
## [33] "albedodry.r"      "albedodry.h"      "airtempa.l"       "airtempa.r"      
## [37] "airtempa.h"       "map.l"            "map.r"            "map.h"           
## [41] "ffd.l"            "ffd.r"            "ffd.h"            "nirrcapcl"       
## [45] "nirrcapscl"       "nirrcapunit"      "irrcapcl"         "irrcapscl"       
## [49] "cropprodindex"    "constreeshrubgrp" "soilslippot"      "frostact"        
## [53] "initsub.l"        "initsub.r"        "initsub.h"        "totalsub.l"      
## [57] "totalsub.r"       "totalsub.h"       "hydgrp"           "corcon"          
## [61] "corsteel"         "taxclname"        "taxorder"         "taxsuborder"     
## [65] "taxgrtgroup"      "taxsubgrp"        "taxpartsize"      "taxpartsizemod"  
## [69] "taxceactcl"       "taxreaction"      "taxtempcl"        "taxmoistscl"     
## [73] "taxtempregime"    "soiltaxedition"   "fltemik2use"      "fltriumph2use"   
## [77] "mukey"            "cokey"

Component data: column selection

component <- component %>%
  dplyr::select(c(comppct.r, compname, majcompflag, slope.r, 
                  slopelenusle.r, runoff, tfact, wei, weg, erocl, 
                  elev.r, albedodry.r, airtempa.r, map.r, ffd.r, 
                  cropprodindex, taxpartsize, mukey, cokey))
dim(component)

## [1] 942  19

Join chorizon and component data

Using dplyr::left_join, we can return all the rows from component
- Join by “cokey”
- And we will get the component level characteristics for each horizon
- There will be some components that do not have horizon level data, and those will fill with NAs

component_horizon <- left_join(component, chorizon, by = c("cokey"))
dim(component_horizon)

## [1] 942  28

Component_horizon joined and cleaned

head(component_horizon)

## # A tibble: 6 x 28
##   comppct.r compname majcompflag slope.r slopelenusle.r runoff tfact   wei   weg
##       <dbl> <chr>    <chr>         <dbl>          <dbl> <chr>  <dbl> <dbl> <dbl>
## 1       100 Seaton   Yes               4             61 Low        5    56     5
## 2       100 Seaton   Yes               9             46 Medium     5    56     5
## 3       100 Seaton   Yes              16             30 Medium     5    56     5
## 4         3 Plainfi… No                4             61 Very …     5   220     1
## 5        95 Chelsea  Yes              22             18 Medium     5   250     1
## 6         2 Forkhorn No                4             61 Very …     3    86     3
## # … with 19 more variables: erocl <chr>, elev.r <dbl>, albedodry.r <dbl>,
## #   airtempa.r <dbl>, map.r <dbl>, ffd.r <dbl>, cropprodindex <dbl>,
## #   taxpartsize <chr>, mukey <dbl>, cokey <dbl>, sand <dbl>, silt <dbl>,
## #   clay <dbl>, om <dbl>, ksat <dbl>, k <dbl>, cec <dbl>, ph <dbl>,
## #   total.depth <dbl>

Rock outcrop (or check component name for a given cokey)

test <- component_horizon %>%
  filter(cokey == "18042866") #cokey full of NAs, noted from earlier slide
test$compname

## [1] "Rock outcrop"

Mapunit data: column selection

colnames(mapunit)

##  [1] "musym"       "muname"      "mukind"      "muacres"     "farmlndcl"  
##  [6] "interpfocus" "invesintens" "iacornsr"    "lkey"        "mukey"      
## [11] "county"

mapunit <- mapunit %>%
  dplyr::select(c(musym, muname, muacres, mukey, county))
head(mapunit)

## # A tibble: 6 x 5
##   musym muname                                            muacres  mukey county 
##   <chr> <chr>                                               <dbl>  <dbl> <chr>  
## 1 112B2 Seaton silt loam, 2 to 6 percent slopes, vfsl su…     210 2.37e6 crawfo…
## 2 112C2 Seaton silt loam, 6 to 12 percent slopes, vfsl s…     488 2.37e6 crawfo…
## 3 112D2 Seaton silt loam, 12 to 20 percent slopes, vfsl …     630 2.37e6 crawfo…
## 4 502D  Chelsea fine sand, 15 to 30 percent slopes             62 2.37e6 crawfo…
## 5 2022  Pits, siliceous sand                                   14 2.38e6 crawfo…
## 6 435D2 Nuxmaruhanixete silt loam, 12 to 20 percent slop…     118 2.38e6 crawfo…

Join component_horizon with mapunit

Again, using dplyr::left_join, we can return all the rows from component_horizon
- This time, join using “mukey”
- We will get component level information for each mapunit
- Recall that we will have some mapunits without data because there was no horizon level data

full_soil <- left_join(component_horizon, mapunit, by = c("mukey"))

Last cleaning step is to change all commas to underscores so that we can save with comma separated values.

full_soil <- full_soil %>%
  mutate(muname = gsub(", ", "_", muname))

Final product

full_soil[1:3,1:5] #rows 1 to 3, columns 1 to 5

## # A tibble: 3 x 5
##   comppct.r compname majcompflag slope.r slopelenusle.r
##       <dbl> <chr>    <chr>         <dbl>          <dbl>
## 1       100 Seaton   Yes               4             61
## 2       100 Seaton   Yes               9             46
## 3       100 Seaton   Yes              16             30

full_soil[5:7,22:29] # rows 5 to 7, columns 22 to 29

## # A tibble: 3 x 8
##    clay    om  ksat     k   cec    ph total.depth musym
##   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>       <dbl> <chr>
## 1  5.97  0.63  91.7  0.15  4.71  6.11         203 502D 
## 2 10.5   2.09  23.3  0.24  5.73  6.2          183 502D 
## 3 NA    NA     NA   NA    NA    NA             NA 2022

Now you can save the data to your working directory!!

write.table(full_soil, "Final Clean Soil/fullSoilCrawfordVernon.txt", 
            row.names = FALSE, quote = FALSE, sep = ",")

Super important In order to import the text file to excel, make sure to use the import task wizard and make the “musym” column into text format. Otherwise ’musym’s coded ending in E2 (ex: 116E2) will be imported as 1.16xE^2.

Done with cleaning data and ready to get to modelling!