NRCS has multiple methods of determining saturated hydraulic conductivity (ksat). Those methods are not always linked to the data. So despite having a large database of ksat data, we can’t always say how we got that data.
But saturated hydraulic conductivity is a key data type that is used in many important soil interpretations.
I wanted to see if we could see which predictors were most important in determining ksat to see if we might find similarities across a patchwork of methods. Can we make broader categories to capture similar ksat values? Would it be more accurate to employ those categories in interpretations instead of numerical data that is (sometimes) unreliable?
Categories could constrain the use of ksat data, keeping data that was determined with questionable or unknown methods from being treated as a highly accurate variable within an important interpretation.
Sites
For this project, we will explore hydric soils in six geographically contrasting wildland sites:
| Soil Survey Area | County | State |
|---|---|---|
| TN640 | GSMNP | TN |
| TN059 | GSMNP | TN |
| TN609 | Blount County | TN |
| TN608 | Sevier County | TN |
| TN606 | Cocke County | TN |
| TN113 | Madison County | TN |
| TN093 | Knox County | TN |
| TN602 | Knox County | TN |
| NC009 | Ashe County | NC |
| NC101 | Johnston County | NC |
| NC099 | Jackson County | NC |
## Warning: package 'usmap' was built under R version 4.5.2
## Warning: package 'maps' was built under R version 4.5.2
##
## Attaching package: 'maps'
## The following object is masked from 'package:purrr':
##
## map
## The following object is masked from 'package:cluster':
##
## votes.repub
Now, we’ll load our data in. This Soil Survey Area data from Soil Data Access was accessed through R via get_chorizon_from_SDA.
After accessing, I used tidyverse to select only the relevant columns:
Here’s a look at the data. We can already see that Ksat is going to be an issue. Identical data like this doesn’t bode well. That doesn’t mean it’s not meaningful, but it’s accuracy should be taken with a grain of salt. Or sand.
Overarching Objective: Use relevant predictors to model hydraulic conductivity.
Hypothesis
We hypothesize that bulk density and organic matter will be good predictors and remain in the model, while at least one of the particle sizes will be tossed out of the model due to similar trends with the other particle sizes. (Ex. as sand increases, silt and clay may decrease proportionally.)
First we need to check for any overly correlated variables, because
we’ll need to remove those from the model.
Looks like we found a two 100% correlated variables! The fragvol_r and total_frag_pct are related. That makes sense. One was probably used to calculate the other. Dropping total_frag_pct!
Now let’s fit a linear model and then take a look at the residuals in plots.
#head(residuals(fit_lm))
## 1 2 3 4 5 6
## -6.0697020 0.7790932 11.1736196 17.0022675 13.2292929 13.1269786
## 1 2 3 4 5 6
## -6.0697020 0.7790932 11.1736196 17.0022675 13.2292929 13.1269786
That could look worse! Let’s keep going.
## dbthirdbar_r sandtotal_r claytotal_r silttotal_r om_r
## 1.096471 18.484421 11.675590 12.772094 1.475545
## total_frags_pct awc_r
## 1.274128 1.395168
## dbthirdbar_r sandtotal_r claytotal_r silttotal_r om_r
## FALSE TRUE TRUE TRUE FALSE
## total_frags_pct awc_r
## FALSE FALSE
## `geom_smooth()` using formula = 'y ~ x'
Wow! That looks truly terrible.
Okay, well, let’s keep going. But do we need all these variables? Time for a stepwise selection.
##
## Backwards Step-down - Original Model
##
## Deleted Chi-Sq d.f. P Residual d.f. P AIC R2
## silttotal_r 0.37 1 0.5415 0.37 1 0.5415 -1.63 122
##
## Approximate Estimates after Deleting Factors
##
## Coef S.E. Wald Z P
## Intercept -1.9940 6.52357 -0.3057 7.599e-01
## dbthirdbar_r 15.2107 3.79035 4.0130 5.995e-05
## sandtotal_r 0.3261 0.02305 14.1495 0.000e+00
## claytotal_r -0.3918 0.03765 -10.4050 0.000e+00
## om_r 1.7996 0.24698 7.2866 3.180e-13
## total_frags_pct 0.1836 0.02341 7.8449 4.330e-15
## awc_r -60.4667 10.39706 -5.8157 6.036e-09
##
## Factors in Final Model
##
## [1] dbthirdbar_r sandtotal_r claytotal_r om_r
## [5] total_frags_pct awc_r
## index.orig training test optimism index.corrected Lower
## R-square 0.4287 0.4375 0.4234 0.0141 0.4146 0.3691
## MSE 213.4496 209.2820 215.4134 -6.1314 219.5811 188.4257
## g 14.3368 14.4405 14.2275 0.2131 14.1237 12.7376
## Intercept 0.0000 0.0000 0.2816 -0.2816 0.2816 -1.1850
## Slope 1.0000 1.0000 0.9849 0.0151 0.9849 0.8796
## Upper n
## R-square 0.4418 40
## MSE 248.8593 40
## g 15.2240 40
## Intercept 1.8758 40
## Slope 1.0740 40
##
## Factors Retained in Backwards Elimination
##
## dbthirdbar_r sandtotal_r claytotal_r silttotal_r om_r total_frags_pct awc_r
## * * * * *
## * * * * *
## * * * * * *
## * * * * * *
## * * * * *
## * * * * *
## * * * * *
## * * * * *
## * * * * * *
## * * * * *
## * * * * *
## * * * * *
## * * * * * *
## * * * * * *
## * * * * * * *
## * * * * *
## * * * * *
## * * * * * * *
## * * * * *
## * * * * *
## * * * * *
## * * * * *
## * * * * *
## * * * * *
## * * * * *
## * * * * *
## * * * * *
## * * * * *
## * * * * *
## * * * * * *
## * * * * *
## * * * * *
## * * * * *
## * * * * *
## * * * * *
## * * * * * *
## * * * * * *
## * * * * * *
## * * * *
## * * * * * *
##
## Frequencies of Numbers of Factors Retained
##
## 4 5 6 7
## 1 27 10 2
The backwards elimination doesn’t actually eliminate anything.
HOWEVER.
If many of these values are determined via the Rosetta pedotransfer function, then why would I have sand, silt, and clay. Those are used to calculate Ksat in the pedotransfer function!
Bulk Density and AWC are also optional in the Rosetta pedotransfer function, but I’ll keep those in for now. It is interesting to think about what variables could be included if I had another data set … soil structure, aggregate stability, something else?
Now we can fit our final model with its paired down variables.
## Linear Regression Model
##
## ols(formula = ksat_r ~ dbthirdbar_r + om_r + total_frags_pct +
## awc_r, data = dat, weights = dat$hzdepb_r, x = TRUE, y = TRUE)
##
## Model Likelihood Discrimination
## Ratio Test Indexes
## Obs 2730 LR chi2 725.76 R2 0.233
## sigma170.9318 d.f. 4 R2 adj 0.232
## d.f. 2725 Pr(> chi2) 0.0000 g 11.740
##
## Residuals
##
## Min 1Q Median 3Q Max
## -220.520 -8.567 -1.951 4.066 178.452
##
##
## Coef S.E. t Pr(>|t|)
## Intercept 11.9820 7.0173 1.71 0.0878
## dbthirdbar_r 15.5465 4.4042 3.53 0.0004
## om_r 3.4389 0.2750 12.50 <0.0001
## total_frags_pct 0.2395 0.0266 8.99 <0.0001
## awc_r -165.2406 11.3078 -14.61 <0.0001
## index.orig training test optimism index.corrected Lower
## R-square 0.0597 0.1747 0.1568 0.0180 0.0418 -0.1409
## MSE 351.2890 308.3487 317.3101 -8.9614 360.2505 185.3761
## g 11.7397 8.3785 8.1587 0.2198 11.5199 7.2769
## Intercept 0.0000 0.0000 0.3507 -0.3507 0.3507 -13.0446
## Slope 1.0000 1.0000 0.9828 0.0172 0.9828 0.3320
## Upper n
## R-square 0.2252 10
## MSE 631.6679 10
## g 15.9210 10
## Intercept 13.3693 10
## Slope 1.6781 10
## [1] 1.527482
## [1] 18.74271
## [1] 0.05972877
## `geom_smooth()` using formula = 'y ~ x'
Honestly, that looks way worse than the first model. I guess removing the predictors that were used to calculate the data decreased the accuracy of the model. But including them seems like pseudo accuracy to me?
I don’t know… ksat is very hard to model.
Let’s look at Final Model Accuracy
## Linear Regression Model
##
## ols(formula = ksat_r ~ dbthirdbar_r + om_r + total_frags_pct +
## awc_r, data = dat, weights = dat$hzdepb_r, x = TRUE, y = TRUE)
##
## Model Likelihood Discrimination
## Ratio Test Indexes
## Obs 2730 LR chi2 725.76 R2 0.233
## sigma170.9318 d.f. 4 R2 adj 0.232
## d.f. 2725 Pr(> chi2) 0.0000 g 11.740
##
## Residuals
##
## Min 1Q Median 3Q Max
## -220.520 -8.567 -1.951 4.066 178.452
##
##
## Coef S.E. t Pr(>|t|)
## Intercept 11.9820 7.0173 1.71 0.0878
## dbthirdbar_r 15.5465 4.4042 3.53 0.0004
## om_r 3.4389 0.2750 12.50 <0.0001
## total_frags_pct 0.2395 0.0266 8.99 <0.0001
## awc_r -165.2406 11.3078 -14.61 <0.0001
Oh, okay. Well, honestly, I’ve seen worse R2s.
Anova
## Analysis of Variance Response: ksat_r
##
## Factor d.f. Partial SS MS F P
## dbthirdbar_r 1 364054 364054.04 12.46 4e-04
## om_r 1 4567539 4567539.33 156.33 <.0001
## total_frags_pct 1 2360672 2360672.24 80.80 <.0001
## awc_r 1 6239146 6239145.96 213.54 <.0001
## REGRESSION 4 24246411 6061602.85 207.46 <.0001
## ERROR 2725 79618171 29217.68
Partial R2
Available water holding capacity causes the most variance
Model Effects
Plot Effects
OM and AWC have the steepest slopes here.
Honestly, this whole thing has been kind of a let down. Maybe I just don’t have enough data?
What if I just did the whole southeast?
For this project, we will explore hydric soils in six geographically contrasting wildland sites:
| State |
|---|
| Tennessee |
| North Carolina |
| South Carolina |
| Alabama |
| Florida |
| Louisiana |
| Mississippi |
| Kentucky |
Overarching Objective: Determine if expanding the data set improves our ability to model ksat data in a linear regression model.
Hypotheses
A multi-state data set will improve the model by increasing the range of ksat values. Adding in states with different soil environments and associated properties will improve the model.
We know which overly correlated variable to avoid this time (fragvol_r)
We’re also gonna leave out sand, silt, and clay
Now let’s fit a linear model and then take a look at the residuals.
## 1 2 3 4 5 6
## 4.1690954 5.0361881 -6.3191619 0.1517006 2.7723989 44.9459291
## 1 2 3 4 5 6
## 4.1690954 5.0361881 -6.3191619 0.1517006 2.7723989 44.9459291
That could look worse! Let’s keep going and fit the model.
## dbthirdbar_r om_r total_frags_pct awc_r
## 1.096620 1.053438 1.081195 1.122974
## dbthirdbar_r om_r total_frags_pct awc_r
## FALSE FALSE FALSE FALSE
## `geom_smooth()` using formula = 'y ~ x'
This looks better to me than the model plot with the previous dataset.
Okay, but do we need all these variables? Time for a stepwise selection.
##
## Backwards Step-down - Original Model
##
## No Factors Deleted
##
## Factors in Final Model
##
## [1] dbthirdbar_r om_r total_frags_pct awc_r
## index.orig training test optimism index.corrected Lower
## R-square 0.3142 0.3151 0.3141 0.0010 0.3132 0.3080
## MSE 634.8642 633.4792 634.9086 -1.4294 636.2936 624.0865
## g 18.8416 18.8627 18.8401 0.0226 18.8191 18.6149
## Intercept 0.0000 0.0000 0.0340 -0.0340 0.0340 -0.3623
## Slope 1.0000 1.0000 0.9985 0.0015 0.9985 0.9859
## Upper n
## R-square 0.3194 40
## MSE 648.6289 40
## g 19.0910 40
## Intercept 0.3146 40
## Slope 1.0190 40
##
## Factors Retained in Backwards Elimination
##
## dbthirdbar_r om_r total_frags_pct awc_r
## * * * *
## * * * *
## * * * *
## * * * *
## * * * *
## * * * *
## * * * *
## * * * *
## * * * *
## * * * *
## * * * *
## * * * *
## * * * *
## * * * *
## * * * *
## * * * *
## * * * *
## * * * *
## * * * *
## * * * *
## * * * *
## * * * *
## * * * *
## * * * *
## * * * *
## * * * *
## * * * *
## * * * *
## * * * *
## * * * *
## * * * *
## * * * *
## * * * *
## * * * *
## * * * *
## * * * *
## * * * *
## * * * *
## * * * *
## * * * *
##
## Frequencies of Numbers of Factors Retained
##
## 4
## 40
Alright, let’s go ahead and keep all four variables. So there won’t be any changes but that’s fine, we’ll go ahead and fit our final model. It will be the same as above.
## Linear Regression Model
##
## ols(formula = ksat_r ~ dbthirdbar_r + om_r + total_frags_pct +
## awc_r, data = dat, weights = dat$hzdepb_r, x = TRUE, y = TRUE)
##
## Model Likelihood Discrimination
## Ratio Test Indexes
## Obs 189952 LR chi2 65915.08 R2 0.293
## sigma256.2682 d.f. 4 R2 adj 0.293
## d.f. 189947 Pr(> chi2) 0.0000 g 19.634
##
## Residuals
##
## Min 1Q Median 3Q Max
## -289.343 -11.640 -0.818 11.733 418.433
##
##
## Coef S.E. t Pr(>|t|)
## Intercept 69.6636 0.9834 70.84 <0.0001
## dbthirdbar_r 6.9130 0.5988 11.54 <0.0001
## om_r 2.0966 0.0331 63.42 <0.0001
## total_frags_pct -0.2425 0.0047 -51.94 <0.0001
## awc_r -448.3970 1.7088 -262.41 <0.0001
## index.orig training test optimism index.corrected Lower
## R-square 0.3025 0.3142 0.3141 0.0001 0.3025 0.2797
## MSE 645.6560 634.8562 635.0096 -0.1534 645.8094 583.7682
## g 19.6342 18.8420 18.8395 0.0025 19.6317 18.7439
## Intercept 0.0000 0.0000 0.0016 -0.0016 0.0016 -1.2722
## Slope 1.0000 1.0000 0.9999 0.0001 0.9999 0.9474
## Upper n
## R-square 0.3342 10
## MSE 740.6073 10
## g 20.9836 10
## Intercept 1.7712 10
## Slope 1.0670 10
## [1] 1.527482
## [1] 25.40976
## [1] 0.3025211
## `geom_smooth()` using formula = 'y ~ x'
Not a bad R2… Could be better could be worse.
Training and test R2s look okay to me.
Final Model Accuracy
## Linear Regression Model
##
## ols(formula = ksat_r ~ dbthirdbar_r + om_r + total_frags_pct +
## awc_r, data = dat, weights = dat$hzdepb_r, x = TRUE, y = TRUE)
##
## Model Likelihood Discrimination
## Ratio Test Indexes
## Obs 189952 LR chi2 65915.08 R2 0.293
## sigma256.2682 d.f. 4 R2 adj 0.293
## d.f. 189947 Pr(> chi2) 0.0000 g 19.634
##
## Residuals
##
## Min 1Q Median 3Q Max
## -289.343 -11.640 -0.818 11.733 418.433
##
##
## Coef S.E. t Pr(>|t|)
## Intercept 69.6636 0.9834 70.84 <0.0001
## dbthirdbar_r 6.9130 0.5988 11.54 <0.0001
## om_r 2.0966 0.0331 63.42 <0.0001
## total_frags_pct -0.2425 0.0047 -51.94 <0.0001
## awc_r -448.3970 1.7088 -262.41 <0.0001
Anova
## Analysis of Variance Response: ksat_r
##
## Factor d.f. Partial SS MS F P
## dbthirdbar_r 1 8752982 8.752982e+06 133.28 <.0001
## om_r 1 264132083 2.641321e+08 4021.90 <.0001
## total_frags_pct 1 177203322 1.772033e+08 2698.25 <.0001
## awc_r 1 4522054404 4.522054e+09 68856.70 <.0001
## REGRESSION 4 5174778959 1.293695e+09 19698.91 <.0001
## ERROR 189947 12474466697 6.567341e+04
Partial R2
Awc causes the most variance. The other variables are basically the same in this plot, which is different than with the previous, smaller dataset.
Model Effects
Plot Effects
You know, I’m starting to think a linear regression model just isn’t going to work.
But what about a tree?
We’ll keep our same dataset and sites.
But we’ll need more packages.
reloading our data.
Let’s look at a correlation matrix.
Yep, fragments v fragments are 100 % correlated, haha.
Sand is negatively correlated with clay and silt, supporting our original hypothesis that we likely wouldn’t need all three since they are related. However, we remvoved them anyway and we’ll continue with them removed for the tree.
Let’s go ahead and prepare our data for the tree, including creating calibration and validation datasets.
## Calibration samples: 132968
## Validation samples: 56984
We’re not going to look at the full tree because I gave up after it took almost an hour to load. This is a big dataset! Almost 190,000 observations after all rows with NAs are removed.
So, insead, let’s just take a look at the pruned tree.
## Entropy — terminal nodes: 6, total nodes: 11
##
## --- Entropy: top 3 splits ---
## var n dev
## 1 awc_r 132968 93086
## 2 awc_r 100596 63923
## 4 om_r 65935 39003
## Pruned Entropy — terminal nodes: 6 (from 6)
I’ll be honest, I actually really like this tree. I think this is a better way to look at the ksat data. Although, it’s definitely notable how important AWC is within this tree, given that it is one of the two parameters that are optional in the Rosetta pedotransfer function that allowed into our model. Bulk density, on the other hand, was left out of the tree.
To summarize the tree:
If available water holding capacity is low then we assume ksat is 28 or higher (92). If coarse fragment percent is greater than 7 than ksat is 28 where as if fragments are lower than we assume ksat is 92.
Does this make sense??? Yeah, maybe. Coarser soils will have higher ksat and lower AWC. The coarse fragments thing doesn’t totally make sense to me… although if there are more coarse fragments taken up the soil mass, then there will be less soil particles like sand occupying that space. So maybe that explains it?
Soils with the highest available water holding capacity use organic matter to determine ksat. For lower om content, we get the lowest ksat (9), while higher OM results in a ksat of 28.
For AWC between 0.11 and 0.14, we use depth to determine ksat. Deeper soils have lower ksat while shallow soils above 46 cm have higher ksat.
This all makes some amount of sense so far…
##
## === Binary spodic/nonspodic CP table ===
##
## Classification tree:
## rpart(formula = ksat_2 ~ om_r + total_frags_pct + dbthirdbar_r +
## awc_r + hzdepb_r, data = cal, method = "class")
##
## Variables actually used in tree construction:
## [1] awc_r hzdepb_r om_r total_frags_pct
##
## Root node error: 93086/132968 = 0.70006
##
## n= 132968
##
## CP nsplit rel error xerror xstd
## 1 0.055637 0 1.00000 1.00000 0.0017950
## 2 0.023505 1 0.94436 0.94436 0.0018542
## 3 0.023011 3 0.89735 0.90128 0.0018903
## 4 0.012429 4 0.87434 0.87435 0.0019088
## 5 0.010000 5 0.86191 0.86197 0.0019163
## 1-SE CP — binary: 0.010000
Next…
I would like to redo this tree with new categories.
But to figure out which categories we want to have, we need to get a sense of what the data looks like. So let’s check it out via histogram.
Okay, where gonna go with the following categories:
| Ksat values | Class |
|---|---|
| <= 9 | low |
| > 9 & <= 20 | medium |
| > 20 | high |
##
## 0 0.005 0.009 0.01 0.015 0.022 0.03 0.05 0.055 0.075
## 12 1 42 2 15 1 5 118 57 1
## 0.1 0.1569 0.2 0.2052 0.21 0.215 0.217 0.22 0.25 0.2826
## 14 3 760 3 2663 57 20 25 15 3
## 0.3 0.3058 0.32 0.35 0.4 0.41 0.4196 0.42 0.5 0.55
## 11 9 1 16 16 35 3 24 109 219
## 0.5571 0.5693 0.6 0.7 0.705 0.71 0.716 0.72 0.74 0.75
## 9 3 23 139 8 434 2 1 1 14
## 0.8 0.9 0.91 0.915 0.917 0.9174 0.9176 0.92 1 1.1
## 11 4081 385 39 49 10 4 1864 3999 9
## 1.15 1.2 1.3 1.4 1.41 1.417 1.45 1.5 1.7 1.71
## 1 127 8 34 61 7 3 1 53 3
## 1.72 1.75 1.8 1.9 2 2.0284 2.05 2.1 2.2 2.21
## 4 3 8 3 1511 18 2 13 627 87
## 2.26 2.3 2.31 2.32 2.325 2.33 2.4 2.5 2.6 2.62
## 7 8 10 196 12 434 32 120 14 2
## 2.7 2.8 2.815 2.82 2.8222 2.8229 2.823 2.83 2.9 3
## 4677 350 4 4325 9 3 40 223 12 2498
## 3.16 3.2 3.3368 3.3736 3.4487 3.5 3.8211 3.8423 3.8709 4
## 2 8 9 9 1 10 18 1 1 271
## 4.1 4.2 4.23 4.42 4.5 5 5.0255 5.5 5.645 5.8
## 4 35 67 2 3 217 3 471 1 2
## 5.9094 6 6.11 6.2214 7 7.053 7.056 7.2 7.265 7.27
## 1 231 1 1 883 2 6 52 12 317
## 7.3911 7.4138 7.5 7.6 7.625 7.7 7.75 7.755 7.76 7.78
## 18 1 7 6 1 749 3 5 1030 4
## 7.8 8 8.2 9 9.0176 9.055 9.1 9.11 9.14 9.17
## 9 2524 13 57037 5 6 29 459 2 20511
## 9.172 9.1735 9.174 9.1743 9.176 9.18 9.19 9.2 9.23 9.41
## 27 65 4 2 111 325 16 103 3 5
## 9.7 10 11 12 13 14 14.1 14.11 15 16
## 3 540 39 949 89 684 8 47 22 4
## 17 18 18.2791 19.1 19.44 20 21 21.17 21.2 21.38
## 6 406 3 3 3 214 40 1 68 1
## 21.7 21.87 21.88 22 23 23.11 23.12 23.285 23.29 23.3
## 2 8 61 233 5878 97 4 4 1401 1
## 23.9 24 25 26 26.8 27 28 28.1 28.22 28.222
## 32 21 60 8 7 8 31672 77 17 23
## 28.225 28.227 28.23 28.235 28.7 28.865 29 30 31 32
## 63 4 7079 95 1 1 5 60 1 73
## 35 36 40 42 42.34 46 50 55 55.5 56
## 57 2 917 103 14 1 4 171 2 2
## 60 65 68 70 70.57 70.71 71 72 72.5 73
## 67 29 2 366 1 3 399 29 12 215
## 75 77 77.5 77.6 77.61 77.63 77.64 77.67 78 80
## 108 5 87 12 3 444 24 1 1954 7
## 84 90 91 91.5 91.7 91.72 91.725 91.735 91.74 91.76
## 4 52 259 91 13 21 6 8 2678 167
## 91.764 91.8 91.83 92 94 95 100 120 140 141
## 123 1 3 13070 2 95 15 117 6 377
## 141.14 141.15 142 150 197 197.5 197.67 198 200 210
## 16 30 7 1 15 3 16 6 15 148
## 211 211.5 211.71 212 246.97 246.995 247 297 348 373
## 1 18 11 274 15 22 425 5 2 4
## 423 432
## 26 1
##
## low medium high
## 94943 24693 70316
Now we need to remake our calibration and validation data sets
## Calibration samples: 132969
## Validation samples: 56983
Now let’s make our tree model.
And then we’ll get to pruning.
## Entropy — terminal nodes: 6, total nodes: 11
##
## --- Entropy: top 3 splits ---
## var n dev
## 1 awc_r 132969 66508
## 2 hzdepb_r 94497 38914
## 5 awc_r 38596 22190
## Pruned Entropy — terminal nodes: 6 (from 6)
I think this looks pretty good!
Medium was unused (bummer) so binary and multiclass are going to essentially be the same.
Some interesting changes from our last tree:
No coarse fragments and no bulk density!
Let’s take a look and make the confusion matrix.
## === Multi-class spodint CP table ===
##
## Classification tree:
## rpart(formula = ksatint3 ~ om_r + total_frags_pct + dbthirdbar_r +
## awc_r + hzdepb_r, data = cal, method = "class")
##
## Variables actually used in tree construction:
## [1] awc_r hzdepb_r om_r
##
## Root node error: 66508/132969 = 0.50018
##
## n= 132969
##
## CP nsplit rel error xerror xstd
## 1 0.225432 0 1.00000 1.00000 0.0027414
## 2 0.031515 1 0.77457 0.77457 0.0026710
## 3 0.013367 3 0.71154 0.70542 0.0026200
## 4 0.010000 5 0.68480 0.68383 0.0026010
##
## === Binary high/low CP table ===
##
## Classification tree:
## rpart(formula = ksatint3 ~ om_r + total_frags_pct + dbthirdbar_r +
## awc_r + hzdepb_r, data = cal, method = "class")
##
## Variables actually used in tree construction:
## [1] awc_r hzdepb_r om_r
##
## Root node error: 66508/132969 = 0.50018
##
## n= 132969
##
## CP nsplit rel error xerror xstd
## 1 0.225432 0 1.00000 1.00000 0.0027414
## 2 0.031515 1 0.77457 0.77457 0.0026710
## 3 0.013367 3 0.71154 0.70542 0.0026200
## 4 0.010000 5 0.68480 0.68383 0.0026010
## 1-SE CP — multi-class: 0.010000
## 1-SE CP — binary: 0.010000
## Model Accuracy Kappa
## 1 Multi-Class (3 levels) 0.66 0.389
## 2 Binary (high/low) 0.66 0.389
Honestly, an accuracy of 0.66 isn’t bad! Better than coin flip!
I definitely think the tree is a better way to model ksat data.
I also think that there are cool things that we can do with our ksat data in its current state, as long as we don’t ignore the limitations that come from the methods used to determine it. For example: thinking of ksat in categories instead of exact values may help us get more out of it without over interpreting it by assuming it is always accurate. It can’t be totally accurate… there are way too many identical values! Just look at the linear regression model.
I am going to continue this work. There’s more I want to do with ksat, and I really like how these trees turned out.
It would be cool if there was a way to bring land management into the tree… I have no idea how I would do that though.
Something to think about!