Saturated Hydraulic Conductivity

Background

NRCS has multiple methods of determining saturated hydraulic conductivity (ksat). Those methods are not always linked to the data. So despite having a large database of ksat data, we can’t always say how we got that data.

But saturated hydraulic conductivity is a key data type that is used in many important soil interpretations.

I wanted to see if we could see which predictors were most important in determining ksat to see if we might find similarities across a patchwork of methods. Can we make broader categories to capture similar ksat values? Would it be more accurate to employ those categories in interpretations instead of numerical data that is (sometimes) unreliable?

Categories could constrain the use of ksat data, keeping data that was determined with questionable or unknown methods from being treated as a highly accurate variable within an important interpretation.

STUDY AREA

Sites

For this project, we will explore hydric soils in six geographically contrasting wildland sites:

Soil Survey Area County State
TN640 GSMNP TN
TN059 GSMNP TN
TN609 Blount County TN
TN608 Sevier County TN
TN606 Cocke County TN
TN113 Madison County TN
TN093 Knox County TN
TN602 Knox County TN
NC009 Ashe County NC
NC101 Johnston County NC
NC099 Jackson County NC
## Warning: package 'usmap' was built under R version 4.5.2
## Warning: package 'maps' was built under R version 4.5.2
## 
## Attaching package: 'maps'
## The following object is masked from 'package:purrr':
## 
##     map
## The following object is masked from 'package:cluster':
## 
##     votes.repub

Now, we’ll load our data in. This Soil Survey Area data from Soil Data Access was accessed through R via get_chorizon_from_SDA.

After accessing, I used tidyverse to select only the relevant columns:

  • Saturated Hydraulic conductivity (ksat_r)
  • Available Water Holding Capacity (awc_r)
  • Volume of coarse fragments (fragvol_r)
  • Bulk Density (dbthirdbar_r)
  • Sand Content (sandtotal_r)
  • Clay Content (claytotal_r)
  • Silt Content (silttotal_r)
  • Organic Matter Content (om_r)
  • Total Fragments Percent (total_frags_pct)

Here’s a look at the data. We can already see that Ksat is going to be an issue. Identical data like this doesn’t bode well. That doesn’t mean it’s not meaningful, but it’s accuracy should be taken with a grain of salt. Or sand.

OBJECTIVES AND HYPOTHESES


Overarching Objective: Use relevant predictors to model hydraulic conductivity.


Hypothesis

We hypothesize that bulk density and organic matter will be good predictors and remain in the model, while at least one of the particle sizes will be tossed out of the model due to similar trends with the other particle sizes. (Ex. as sand increases, silt and clay may decrease proportionally.)

RESULTS AND DISCUSSION

First we need to check for any overly correlated variables, because we’ll need to remove those from the model.

Looks like we found a two 100% correlated variables! The fragvol_r and total_frag_pct are related. That makes sense. One was probably used to calculate the other. Dropping total_frag_pct!

Now let’s fit a linear model and then take a look at the residuals in plots.

#head(residuals(fit_lm))

##          1          2          3          4          5          6 
## -6.0697020  0.7790932 11.1736196 17.0022675 13.2292929 13.1269786
##          1          2          3          4          5          6 
## -6.0697020  0.7790932 11.1736196 17.0022675 13.2292929 13.1269786

That could look worse! Let’s keep going.

##    dbthirdbar_r     sandtotal_r     claytotal_r     silttotal_r            om_r 
##        1.096471       18.484421       11.675590       12.772094        1.475545 
## total_frags_pct           awc_r 
##        1.274128        1.395168
##    dbthirdbar_r     sandtotal_r     claytotal_r     silttotal_r            om_r 
##           FALSE            TRUE            TRUE            TRUE           FALSE 
## total_frags_pct           awc_r 
##           FALSE           FALSE
## `geom_smooth()` using formula = 'y ~ x'

Wow! That looks truly terrible.

Okay, well, let’s keep going. But do we need all these variables? Time for a stepwise selection.

## 
##      Backwards Step-down - Original Model
## 
##  Deleted     Chi-Sq d.f. P      Residual d.f. P      AIC   R2 
##  silttotal_r 0.37   1    0.5415 0.37     1    0.5415 -1.63 122
## 
## Approximate Estimates after Deleting Factors
## 
##                     Coef     S.E.   Wald Z         P
## Intercept        -1.9940  6.52357  -0.3057 7.599e-01
## dbthirdbar_r     15.2107  3.79035   4.0130 5.995e-05
## sandtotal_r       0.3261  0.02305  14.1495 0.000e+00
## claytotal_r      -0.3918  0.03765 -10.4050 0.000e+00
## om_r              1.7996  0.24698   7.2866 3.180e-13
## total_frags_pct   0.1836  0.02341   7.8449 4.330e-15
## awc_r           -60.4667 10.39706  -5.8157 6.036e-09
## 
## Factors in Final Model
## 
## [1] dbthirdbar_r    sandtotal_r     claytotal_r     om_r           
## [5] total_frags_pct awc_r
##           index.orig training     test optimism index.corrected    Lower
## R-square      0.4287   0.4375   0.4234   0.0141          0.4146   0.3691
## MSE         213.4496 209.2820 215.4134  -6.1314        219.5811 188.4257
## g            14.3368  14.4405  14.2275   0.2131         14.1237  12.7376
## Intercept     0.0000   0.0000   0.2816  -0.2816          0.2816  -1.1850
## Slope         1.0000   1.0000   0.9849   0.0151          0.9849   0.8796
##              Upper  n
## R-square    0.4418 40
## MSE       248.8593 40
## g          15.2240 40
## Intercept   1.8758 40
## Slope       1.0740 40
## 
## Factors Retained in Backwards Elimination
## 
##  dbthirdbar_r sandtotal_r claytotal_r silttotal_r om_r total_frags_pct awc_r
##               *           *                       *    *               *    
##               *           *                       *    *               *    
##  *            *           *           *                *               *    
##               *           *           *           *    *               *    
##               *           *                       *    *               *    
##               *           *                       *    *               *    
##               *           *                       *    *               *    
##               *                       *           *    *               *    
##  *            *                       *           *    *               *    
##               *                       *           *    *               *    
##               *                       *           *    *               *    
##               *                       *           *    *               *    
##               *           *           *           *    *               *    
##               *           *           *           *    *               *    
##  *            *           *           *           *    *               *    
##               *                       *           *    *               *    
##               *                       *           *    *               *    
##  *            *           *           *           *    *               *    
##               *           *                       *    *               *    
##               *           *                       *    *               *    
##               *                       *           *    *               *    
##               *                       *           *    *               *    
##               *                       *           *    *               *    
##               *                       *           *    *               *    
##               *           *                       *    *               *    
##               *                       *           *    *               *    
##               *                       *           *    *               *    
##               *           *                       *    *               *    
##               *           *                       *    *               *    
##  *            *           *                       *    *               *    
##               *                       *           *    *               *    
##               *           *                       *    *               *    
##                           *           *           *    *               *    
##               *                       *           *    *               *    
##               *                       *           *    *               *    
##  *            *                       *           *    *               *    
##               *           *           *           *    *               *    
##               *           *           *           *    *               *    
##               *           *                       *    *                    
##               *           *           *           *    *               *    
## 
## Frequencies of Numbers of Factors Retained
## 
##  4  5  6  7 
##  1 27 10  2

The backwards elimination doesn’t actually eliminate anything.

HOWEVER.

If many of these values are determined via the Rosetta pedotransfer function, then why would I have sand, silt, and clay. Those are used to calculate Ksat in the pedotransfer function!

Bulk Density and AWC are also optional in the Rosetta pedotransfer function, but I’ll keep those in for now. It is interesting to think about what variables could be included if I had another data set … soil structure, aggregate stability, something else?

Now we can fit our final model with its paired down variables.

## Linear Regression Model
## 
## ols(formula = ksat_r ~ dbthirdbar_r + om_r + total_frags_pct + 
##     awc_r, data = dat, weights = dat$hzdepb_r, x = TRUE, y = TRUE)
## 
##                   Model Likelihood    Discrimination    
##                         Ratio Test           Indexes    
## Obs      2730    LR chi2    725.76    R2       0.233    
## sigma170.9318    d.f.            4    R2 adj   0.232    
## d.f.     2725    Pr(> chi2) 0.0000    g       11.740    
## 
## Residuals
## 
##      Min       1Q   Median       3Q      Max 
## -220.520   -8.567   -1.951    4.066  178.452 
## 
## 
##                 Coef      S.E.    t      Pr(>|t|)
## Intercept         11.9820  7.0173   1.71 0.0878  
## dbthirdbar_r      15.5465  4.4042   3.53 0.0004  
## om_r               3.4389  0.2750  12.50 <0.0001 
## total_frags_pct    0.2395  0.0266   8.99 <0.0001 
## awc_r           -165.2406 11.3078 -14.61 <0.0001
##           index.orig training     test optimism index.corrected    Lower
## R-square      0.0597   0.1747   0.1568   0.0180          0.0418  -0.1409
## MSE         351.2890 308.3487 317.3101  -8.9614        360.2505 185.3761
## g            11.7397   8.3785   8.1587   0.2198         11.5199   7.2769
## Intercept     0.0000   0.0000   0.3507  -0.3507          0.3507 -13.0446
## Slope         1.0000   1.0000   0.9828   0.0172          0.9828   0.3320
##              Upper  n
## R-square    0.2252 10
## MSE       631.6679 10
## g          15.9210 10
## Intercept  13.3693 10
## Slope       1.6781 10
## [1] 1.527482
## [1] 18.74271
## [1] 0.05972877
## `geom_smooth()` using formula = 'y ~ x'

Honestly, that looks way worse than the first model. I guess removing the predictors that were used to calculate the data decreased the accuracy of the model. But including them seems like pseudo accuracy to me?

I don’t know… ksat is very hard to model.

Let’s look at Final Model Accuracy

## Linear Regression Model
## 
## ols(formula = ksat_r ~ dbthirdbar_r + om_r + total_frags_pct + 
##     awc_r, data = dat, weights = dat$hzdepb_r, x = TRUE, y = TRUE)
## 
##                   Model Likelihood    Discrimination    
##                         Ratio Test           Indexes    
## Obs      2730    LR chi2    725.76    R2       0.233    
## sigma170.9318    d.f.            4    R2 adj   0.232    
## d.f.     2725    Pr(> chi2) 0.0000    g       11.740    
## 
## Residuals
## 
##      Min       1Q   Median       3Q      Max 
## -220.520   -8.567   -1.951    4.066  178.452 
## 
## 
##                 Coef      S.E.    t      Pr(>|t|)
## Intercept         11.9820  7.0173   1.71 0.0878  
## dbthirdbar_r      15.5465  4.4042   3.53 0.0004  
## om_r               3.4389  0.2750  12.50 <0.0001 
## total_frags_pct    0.2395  0.0266   8.99 <0.0001 
## awc_r           -165.2406 11.3078 -14.61 <0.0001

Oh, okay. Well, honestly, I’ve seen worse R2s.

Anova

##                 Analysis of Variance          Response: ksat_r 
## 
##  Factor          d.f. Partial SS MS         F      P     
##  dbthirdbar_r       1   364054    364054.04  12.46 4e-04 
##  om_r               1  4567539   4567539.33 156.33 <.0001
##  total_frags_pct    1  2360672   2360672.24  80.80 <.0001
##  awc_r              1  6239146   6239145.96 213.54 <.0001
##  REGRESSION         4 24246411   6061602.85 207.46 <.0001
##  ERROR           2725 79618171     29217.68

Partial R2

Available water holding capacity causes the most variance

Model Effects

Plot Effects

OM and AWC have the steepest slopes here.

Honestly, this whole thing has been kind of a let down. Maybe I just don’t have enough data?

What if I just did the whole southeast?

STUDY AREA

For this project, we will explore hydric soils in six geographically contrasting wildland sites:

State
Tennessee
North Carolina
South Carolina
Alabama
Florida
Louisiana
Mississippi
Kentucky

OBJECTIVES AND HYPOTHESES


Overarching Objective: Determine if expanding the data set improves our ability to model ksat data in a linear regression model.


Hypotheses

A multi-state data set will improve the model by increasing the range of ksat values. Adding in states with different soil environments and associated properties will improve the model.

RESULTS AND DISCUSSION

We know which overly correlated variable to avoid this time (fragvol_r)

We’re also gonna leave out sand, silt, and clay

Now let’s fit a linear model and then take a look at the residuals.

##          1          2          3          4          5          6 
##  4.1690954  5.0361881 -6.3191619  0.1517006  2.7723989 44.9459291
##          1          2          3          4          5          6 
##  4.1690954  5.0361881 -6.3191619  0.1517006  2.7723989 44.9459291

That could look worse! Let’s keep going and fit the model.

##    dbthirdbar_r            om_r total_frags_pct           awc_r 
##        1.096620        1.053438        1.081195        1.122974
##    dbthirdbar_r            om_r total_frags_pct           awc_r 
##           FALSE           FALSE           FALSE           FALSE
## `geom_smooth()` using formula = 'y ~ x'

This looks better to me than the model plot with the previous dataset.

Okay, but do we need all these variables? Time for a stepwise selection.

## 
##      Backwards Step-down - Original Model
## 
## No Factors Deleted
## 
## Factors in Final Model
## 
## [1] dbthirdbar_r    om_r            total_frags_pct awc_r
##           index.orig training     test optimism index.corrected    Lower
## R-square      0.3142   0.3151   0.3141   0.0010          0.3132   0.3080
## MSE         634.8642 633.4792 634.9086  -1.4294        636.2936 624.0865
## g            18.8416  18.8627  18.8401   0.0226         18.8191  18.6149
## Intercept     0.0000   0.0000   0.0340  -0.0340          0.0340  -0.3623
## Slope         1.0000   1.0000   0.9985   0.0015          0.9985   0.9859
##              Upper  n
## R-square    0.3194 40
## MSE       648.6289 40
## g          19.0910 40
## Intercept   0.3146 40
## Slope       1.0190 40
## 
## Factors Retained in Backwards Elimination
## 
##  dbthirdbar_r om_r total_frags_pct awc_r
##  *            *    *               *    
##  *            *    *               *    
##  *            *    *               *    
##  *            *    *               *    
##  *            *    *               *    
##  *            *    *               *    
##  *            *    *               *    
##  *            *    *               *    
##  *            *    *               *    
##  *            *    *               *    
##  *            *    *               *    
##  *            *    *               *    
##  *            *    *               *    
##  *            *    *               *    
##  *            *    *               *    
##  *            *    *               *    
##  *            *    *               *    
##  *            *    *               *    
##  *            *    *               *    
##  *            *    *               *    
##  *            *    *               *    
##  *            *    *               *    
##  *            *    *               *    
##  *            *    *               *    
##  *            *    *               *    
##  *            *    *               *    
##  *            *    *               *    
##  *            *    *               *    
##  *            *    *               *    
##  *            *    *               *    
##  *            *    *               *    
##  *            *    *               *    
##  *            *    *               *    
##  *            *    *               *    
##  *            *    *               *    
##  *            *    *               *    
##  *            *    *               *    
##  *            *    *               *    
##  *            *    *               *    
##  *            *    *               *    
## 
## Frequencies of Numbers of Factors Retained
## 
##  4 
## 40

Alright, let’s go ahead and keep all four variables. So there won’t be any changes but that’s fine, we’ll go ahead and fit our final model. It will be the same as above.

## Linear Regression Model
## 
## ols(formula = ksat_r ~ dbthirdbar_r + om_r + total_frags_pct + 
##     awc_r, data = dat, weights = dat$hzdepb_r, x = TRUE, y = TRUE)
## 
##                     Model Likelihood    Discrimination    
##                           Ratio Test           Indexes    
## Obs    189952    LR chi2    65915.08    R2       0.293    
## sigma256.2682    d.f.              4    R2 adj   0.293    
## d.f.   189947    Pr(> chi2)   0.0000    g       19.634    
## 
## Residuals
## 
##      Min       1Q   Median       3Q      Max 
## -289.343  -11.640   -0.818   11.733  418.433 
## 
## 
##                 Coef      S.E.   t       Pr(>|t|)
## Intercept         69.6636 0.9834   70.84 <0.0001 
## dbthirdbar_r       6.9130 0.5988   11.54 <0.0001 
## om_r               2.0966 0.0331   63.42 <0.0001 
## total_frags_pct   -0.2425 0.0047  -51.94 <0.0001 
## awc_r           -448.3970 1.7088 -262.41 <0.0001
##           index.orig training     test optimism index.corrected    Lower
## R-square      0.3025   0.3142   0.3141   0.0001          0.3025   0.2797
## MSE         645.6560 634.8562 635.0096  -0.1534        645.8094 583.7682
## g            19.6342  18.8420  18.8395   0.0025         19.6317  18.7439
## Intercept     0.0000   0.0000   0.0016  -0.0016          0.0016  -1.2722
## Slope         1.0000   1.0000   0.9999   0.0001          0.9999   0.9474
##              Upper  n
## R-square    0.3342 10
## MSE       740.6073 10
## g          20.9836 10
## Intercept   1.7712 10
## Slope       1.0670 10
## [1] 1.527482
## [1] 25.40976
## [1] 0.3025211
## `geom_smooth()` using formula = 'y ~ x'

Not a bad R2… Could be better could be worse.

Training and test R2s look okay to me.

Final Model Accuracy

## Linear Regression Model
## 
## ols(formula = ksat_r ~ dbthirdbar_r + om_r + total_frags_pct + 
##     awc_r, data = dat, weights = dat$hzdepb_r, x = TRUE, y = TRUE)
## 
##                     Model Likelihood    Discrimination    
##                           Ratio Test           Indexes    
## Obs    189952    LR chi2    65915.08    R2       0.293    
## sigma256.2682    d.f.              4    R2 adj   0.293    
## d.f.   189947    Pr(> chi2)   0.0000    g       19.634    
## 
## Residuals
## 
##      Min       1Q   Median       3Q      Max 
## -289.343  -11.640   -0.818   11.733  418.433 
## 
## 
##                 Coef      S.E.   t       Pr(>|t|)
## Intercept         69.6636 0.9834   70.84 <0.0001 
## dbthirdbar_r       6.9130 0.5988   11.54 <0.0001 
## om_r               2.0966 0.0331   63.42 <0.0001 
## total_frags_pct   -0.2425 0.0047  -51.94 <0.0001 
## awc_r           -448.3970 1.7088 -262.41 <0.0001

Anova

##                 Analysis of Variance          Response: ksat_r 
## 
##  Factor          d.f.   Partial SS  MS           F        P     
##  dbthirdbar_r         1     8752982 8.752982e+06   133.28 <.0001
##  om_r                 1   264132083 2.641321e+08  4021.90 <.0001
##  total_frags_pct      1   177203322 1.772033e+08  2698.25 <.0001
##  awc_r                1  4522054404 4.522054e+09 68856.70 <.0001
##  REGRESSION           4  5174778959 1.293695e+09 19698.91 <.0001
##  ERROR           189947 12474466697 6.567341e+04

Partial R2

Awc causes the most variance. The other variables are basically the same in this plot, which is different than with the previous, smaller dataset.

Model Effects

Plot Effects

You know, I’m starting to think a linear regression model just isn’t going to work.

But what about a tree?

We’ll keep our same dataset and sites.

But we’ll need more packages.

reloading our data.

Let’s look at a correlation matrix.

Yep, fragments v fragments are 100 % correlated, haha.

Sand is negatively correlated with clay and silt, supporting our original hypothesis that we likely wouldn’t need all three since they are related. However, we remvoved them anyway and we’ll continue with them removed for the tree.

Let’s go ahead and prepare our data for the tree, including creating calibration and validation datasets.

## Calibration samples: 132968
## Validation samples:  56984

We’re not going to look at the full tree because I gave up after it took almost an hour to load. This is a big dataset! Almost 190,000 observations after all rows with NAs are removed.

So, insead, let’s just take a look at the pruned tree.

## Entropy   — terminal nodes: 6,  total nodes: 11
## 
## --- Entropy: top 3 splits ---
##     var      n   dev
## 1 awc_r 132968 93086
## 2 awc_r 100596 63923
## 4  om_r  65935 39003
## Pruned Entropy — terminal nodes: 6  (from 6)

I’ll be honest, I actually really like this tree. I think this is a better way to look at the ksat data. Although, it’s definitely notable how important AWC is within this tree, given that it is one of the two parameters that are optional in the Rosetta pedotransfer function that allowed into our model. Bulk density, on the other hand, was left out of the tree.

To summarize the tree:

If available water holding capacity is low then we assume ksat is 28 or higher (92). If coarse fragment percent is greater than 7 than ksat is 28 where as if fragments are lower than we assume ksat is 92.

Does this make sense??? Yeah, maybe. Coarser soils will have higher ksat and lower AWC. The coarse fragments thing doesn’t totally make sense to me… although if there are more coarse fragments taken up the soil mass, then there will be less soil particles like sand occupying that space. So maybe that explains it?

Soils with the highest available water holding capacity use organic matter to determine ksat. For lower om content, we get the lowest ksat (9), while higher OM results in a ksat of 28.

For AWC between 0.11 and 0.14, we use depth to determine ksat. Deeper soils have lower ksat while shallow soils above 46 cm have higher ksat.

This all makes some amount of sense so far…

## 
## === Binary spodic/nonspodic CP table ===
## 
## Classification tree:
## rpart(formula = ksat_2 ~ om_r + total_frags_pct + dbthirdbar_r + 
##     awc_r + hzdepb_r, data = cal, method = "class")
## 
## Variables actually used in tree construction:
## [1] awc_r           hzdepb_r        om_r            total_frags_pct
## 
## Root node error: 93086/132968 = 0.70006
## 
## n= 132968 
## 
##         CP nsplit rel error  xerror      xstd
## 1 0.055637      0   1.00000 1.00000 0.0017950
## 2 0.023505      1   0.94436 0.94436 0.0018542
## 3 0.023011      3   0.89735 0.90128 0.0018903
## 4 0.012429      4   0.87434 0.87435 0.0019088
## 5 0.010000      5   0.86191 0.86197 0.0019163

## 1-SE CP — binary:      0.010000

Next…

I would like to redo this tree with new categories.

But to figure out which categories we want to have, we need to get a sense of what the data looks like. So let’s check it out via histogram.

Okay, where gonna go with the following categories:

Ksat values Class
<= 9 low
> 9 & <= 20 medium
> 20 high
## 
##       0   0.005   0.009    0.01   0.015   0.022    0.03    0.05   0.055   0.075 
##      12       1      42       2      15       1       5     118      57       1 
##     0.1  0.1569     0.2  0.2052    0.21   0.215   0.217    0.22    0.25  0.2826 
##      14       3     760       3    2663      57      20      25      15       3 
##     0.3  0.3058    0.32    0.35     0.4    0.41  0.4196    0.42     0.5    0.55 
##      11       9       1      16      16      35       3      24     109     219 
##  0.5571  0.5693     0.6     0.7   0.705    0.71   0.716    0.72    0.74    0.75 
##       9       3      23     139       8     434       2       1       1      14 
##     0.8     0.9    0.91   0.915   0.917  0.9174  0.9176    0.92       1     1.1 
##      11    4081     385      39      49      10       4    1864    3999       9 
##    1.15     1.2     1.3     1.4    1.41   1.417    1.45     1.5     1.7    1.71 
##       1     127       8      34      61       7       3       1      53       3 
##    1.72    1.75     1.8     1.9       2  2.0284    2.05     2.1     2.2    2.21 
##       4       3       8       3    1511      18       2      13     627      87 
##    2.26     2.3    2.31    2.32   2.325    2.33     2.4     2.5     2.6    2.62 
##       7       8      10     196      12     434      32     120      14       2 
##     2.7     2.8   2.815    2.82  2.8222  2.8229   2.823    2.83     2.9       3 
##    4677     350       4    4325       9       3      40     223      12    2498 
##    3.16     3.2  3.3368  3.3736  3.4487     3.5  3.8211  3.8423  3.8709       4 
##       2       8       9       9       1      10      18       1       1     271 
##     4.1     4.2    4.23    4.42     4.5       5  5.0255     5.5   5.645     5.8 
##       4      35      67       2       3     217       3     471       1       2 
##  5.9094       6    6.11  6.2214       7   7.053   7.056     7.2   7.265    7.27 
##       1     231       1       1     883       2       6      52      12     317 
##  7.3911  7.4138     7.5     7.6   7.625     7.7    7.75   7.755    7.76    7.78 
##      18       1       7       6       1     749       3       5    1030       4 
##     7.8       8     8.2       9  9.0176   9.055     9.1    9.11    9.14    9.17 
##       9    2524      13   57037       5       6      29     459       2   20511 
##   9.172  9.1735   9.174  9.1743   9.176    9.18    9.19     9.2    9.23    9.41 
##      27      65       4       2     111     325      16     103       3       5 
##     9.7      10      11      12      13      14    14.1   14.11      15      16 
##       3     540      39     949      89     684       8      47      22       4 
##      17      18 18.2791    19.1   19.44      20      21   21.17    21.2   21.38 
##       6     406       3       3       3     214      40       1      68       1 
##    21.7   21.87   21.88      22      23   23.11   23.12  23.285   23.29    23.3 
##       2       8      61     233    5878      97       4       4    1401       1 
##    23.9      24      25      26    26.8      27      28    28.1   28.22  28.222 
##      32      21      60       8       7       8   31672      77      17      23 
##  28.225  28.227   28.23  28.235    28.7  28.865      29      30      31      32 
##      63       4    7079      95       1       1       5      60       1      73 
##      35      36      40      42   42.34      46      50      55    55.5      56 
##      57       2     917     103      14       1       4     171       2       2 
##      60      65      68      70   70.57   70.71      71      72    72.5      73 
##      67      29       2     366       1       3     399      29      12     215 
##      75      77    77.5    77.6   77.61   77.63   77.64   77.67      78      80 
##     108       5      87      12       3     444      24       1    1954       7 
##      84      90      91    91.5    91.7   91.72  91.725  91.735   91.74   91.76 
##       4      52     259      91      13      21       6       8    2678     167 
##  91.764    91.8   91.83      92      94      95     100     120     140     141 
##     123       1       3   13070       2      95      15     117       6     377 
##  141.14  141.15     142     150     197   197.5  197.67     198     200     210 
##      16      30       7       1      15       3      16       6      15     148 
##     211   211.5  211.71     212  246.97 246.995     247     297     348     373 
##       1      18      11     274      15      22     425       5       2       4 
##     423     432 
##      26       1
## 
##    low medium   high 
##  94943  24693  70316

Now we need to remake our calibration and validation data sets

## Calibration samples: 132969
## Validation samples:  56983

Now let’s make our tree model.

And then we’ll get to pruning.

## Entropy   — terminal nodes: 6,  total nodes: 11
## 
## --- Entropy: top 3 splits ---
##        var      n   dev
## 1    awc_r 132969 66508
## 2 hzdepb_r  94497 38914
## 5    awc_r  38596 22190
## Pruned Entropy — terminal nodes: 6  (from 6)

I think this looks pretty good!

Medium was unused (bummer) so binary and multiclass are going to essentially be the same.

Some interesting changes from our last tree:

  • If available water content is low then assume ksat is high
  • If available water content is low and soil horizon is deeper than 50 cm then assume ksat is low
  • Shallow soil horizons break into high or low ksat based on levels of AWC and organic matter content.

No coarse fragments and no bulk density!

Let’s take a look and make the confusion matrix.

## === Multi-class spodint CP table ===
## 
## Classification tree:
## rpart(formula = ksatint3 ~ om_r + total_frags_pct + dbthirdbar_r + 
##     awc_r + hzdepb_r, data = cal, method = "class")
## 
## Variables actually used in tree construction:
## [1] awc_r    hzdepb_r om_r    
## 
## Root node error: 66508/132969 = 0.50018
## 
## n= 132969 
## 
##         CP nsplit rel error  xerror      xstd
## 1 0.225432      0   1.00000 1.00000 0.0027414
## 2 0.031515      1   0.77457 0.77457 0.0026710
## 3 0.013367      3   0.71154 0.70542 0.0026200
## 4 0.010000      5   0.68480 0.68383 0.0026010
## 
## === Binary high/low CP table ===
## 
## Classification tree:
## rpart(formula = ksatint3 ~ om_r + total_frags_pct + dbthirdbar_r + 
##     awc_r + hzdepb_r, data = cal, method = "class")
## 
## Variables actually used in tree construction:
## [1] awc_r    hzdepb_r om_r    
## 
## Root node error: 66508/132969 = 0.50018
## 
## n= 132969 
## 
##         CP nsplit rel error  xerror      xstd
## 1 0.225432      0   1.00000 1.00000 0.0027414
## 2 0.031515      1   0.77457 0.77457 0.0026710
## 3 0.013367      3   0.71154 0.70542 0.0026200
## 4 0.010000      5   0.68480 0.68383 0.0026010

## 1-SE CP — multi-class: 0.010000
## 1-SE CP — binary:      0.010000

##                    Model Accuracy Kappa
## 1 Multi-Class (3 levels)     0.66 0.389
## 2      Binary (high/low)     0.66 0.389

Honestly, an accuracy of 0.66 isn’t bad! Better than coin flip!

DISCUSSION

I definitely think the tree is a better way to model ksat data.

I also think that there are cool things that we can do with our ksat data in its current state, as long as we don’t ignore the limitations that come from the methods used to determine it. For example: thinking of ksat in categories instead of exact values may help us get more out of it without over interpreting it by assuming it is always accurate. It can’t be totally accurate… there are way too many identical values! Just look at the linear regression model.

I am going to continue this work. There’s more I want to do with ksat, and I really like how these trees turned out.

It would be cool if there was a way to bring land management into the tree… I have no idea how I would do that though.

Something to think about!