library(foreign)
options( java.parameters = "-Xmx12g" )
options(mc.cores=4)
library(RWeka)
library(tidyr)
library(dplyr)
library(purrr)
#library(pracma) #Savitzky_Golnay filter, savgol() - unused
library(knitr) #kable
library(caret) # cross-validate R's lm()
# filename: advweka_soilsamples.Rmd

Introduction

Can Total Organic Carbon of soil samples be predicted, quickly and easily, from Near-IR spectroscopy data?

The following problem set -the Dataset selection and question design- was created by a researcher from New Zealand. It was an exercise to be completed by students of a Massive Open Online Course, “Advanced Data Mining with Weka”.

Application: Infrared data from soil samples

by Geoff Holmes

Department of Computer Science, University of Waikato, New Zealand

My (K.B.) Objectives were:

have the problem set available in a scripted form (instead of being purely GUI-based)
learn R -> Weka interoperability
learn Weka standard features (cross-validation) equivalent in R.
learn more RMarkdown, output formatting tricks (e.g. kable())
try row normalization in Weka, try Savitzky-Golay Filter in R (unsuccessful, code removed)

Activity / Exercise Text

Lesson 1.6 Activity: Analyzing a soil sample

We will examine a soil dataset that is described here.

It originates in Kenya and is supplied courtesy of the World Agroforestry Centre (ICRAF) and ISRIC, the International Soil Reference and Information Centre.

1. Preprocessing

The dataset has been converted into an ARFF file called org_c_n. Load it into the Weka Explorer. The instances represent 4439 samples of soil that have been processed by a NIR (near-infrared) device. Most of the 220 attributes are wave bands, and contain the reflectance values produced by the device. For our purposes the dataset should contain only the wave bands plus the class we are interested in, and for this activity we will concentrate on organic carbon. Remove the unnecessary attributes from the dataset.

How many attributes remain?

samp <- read.arff(file="org_c_n.arff")
samp <- samp %>% 
        select( -Batch_Labid, -ISO)

Answer: 218

First and last 5 rows of the dataset:

glimpse(samp[,c(1:5, ((ncol(samp) - 5):(ncol(samp))))])

## Observations: 4,439
## Variables: 11
## $ W350            <dbl> 0.08727, 0.09176, 0.08909, 0.09495, 0.09124, 0...
## $ W360            <dbl> 0.07229, 0.07082, 0.06935, 0.08900, 0.06571, 0...
## $ W370            <dbl> 0.06788, 0.06902, 0.06966, 0.08105, 0.06595, 0...
## $ W380            <dbl> 0.07128, 0.07013, 0.06820, 0.08351, 0.06622, 0...
## $ W390            <dbl> 0.07091, 0.07222, 0.07005, 0.08543, 0.06628, 0...
## $ W2470           <dbl> 0.3473, 0.3241, 0.2760, 0.2812, 0.3233, 0.3279...
## $ W2480           <dbl> 0.3393, 0.3262, 0.2687, 0.2807, 0.3206, 0.3255...
## $ W2490           <dbl> 0.3368, 0.3264, 0.2799, 0.2879, 0.3260, 0.3287...
## $ W2500           <dbl> 0.3428, 0.3309, 0.2756, 0.2903, 0.3245, 0.3240...
## $ OrganicNitrogen <dbl> 0.09, 0.06, 0.06, 0.05, NA, NA, NA, NA, 0.10, ...
## $ OrganicCarbon   <dbl> 0.99, 0.65, 0.46, 0.47, 0.19, 0.15, 0.13, NA, ...

There is still a problem with this dataset. If you click on the class attribute, OrganicCarbon, you will see that 12% of the values are missing.

samp %>% select(OrganicNitrogen, OrganicCarbon) %>% summary()

##  OrganicNitrogen OrganicCarbon 
##  Min.   :0.0     Min.   : 0.0  
##  1st Qu.:0.0     1st Qu.: 0.2  
##  Median :0.1     Median : 0.5  
##  Mean   :0.1     Mean   : 1.3  
##  3rd Qu.:0.1     3rd Qu.: 1.2  
##  Max.   :7.0     Max.   :62.8  
##  NA's   :1555    NA's   :528

These are samples for which there was no wet chemistry reference, and are useless for our purpose. Use an appropriate Weka instance filter to remove all instances whose class attribute is missing.

How many instances remain?

samp <- samp %>%
        filter(!is.na(OrganicCarbon) )

Answer: 3911

samp %>% select(OrganicNitrogen, OrganicCarbon) %>% summary()

##  OrganicNitrogen OrganicCarbon  
##  Min.   :0.0     Min.   : 0.00  
##  1st Qu.:0.0     1st Qu.: 0.22  
##  Median :0.1     Median : 0.49  
##  Mean   :0.1     Mean   : 1.27  
##  3rd Qu.:0.1     3rd Qu.: 1.20  
##  Max.   :7.0     Max.   :62.78  
##  NA's   :1041

We now set about benchmarking. The class is numeric, making this a regression problem. A simple classifier for regression problems is LinearRegression. Choose this in the Classify panel, along with 10-fold cross-validation (the default).

What correlation coefficient does LinearRegression achieve (to four decimal places)?

samp <- samp  %>%
        select(-OrganicNitrogen)

# too simple, not cross-validated:
fit0 <- samp  %>% lm(data=., OrganicCarbon ~ .) %>% summary %>% .$r.squared


# 10 fold cross-validation with caret
train_control <- trainControl(method="cv", number=10)
# fix the parameters of the algorithm ???
#grid <- expand.grid(.fL=c(0), .usekernel=c(FALSE))
fit1 <- train(OrganicCarbon~., data=samp, trControl=train_control, method="lm")

#fit1.summ <- summary(fit1)

fit2 <-samp  %>%
        RWeka::LinearRegression(data=., OrganicCarbon ~ .) %>%
        evaluate_Weka_classifier(.,
                              numFolds = 10, complexity = FALSE,
                              seed = 1, class = TRUE)

fit2.summ <- summary(fit2)

Answer:

R’s lm() regression, not cross validated: r² = 0.5222
R’s lm() regression, 10-fold cross validated wiith caret package: r² = 0.4319
Weka Linear Regression, 10-fold cross-validated: r² = 0.3951

Next we investigate the performance of some more sophisticated classifiers: M5P, RepTree and RandomForest. (There are other possibilities, but they are all slower.) Run these three with default settings, and record the resulting correlation coefficients.

What is the best correlation coefficient achieved by these classifiers?

weka_summary <- function(classifier, dfr){
        
        learner <- make_Weka_classifier(name=classifier)
        fit <-  dfr %>% 
                learner(data=., OrganicCarbon ~ .)
        # use 10-fold cross validation
        e <- evaluate_Weka_classifier(fit,
                              numFolds = 10, complexity = FALSE,
                              seed = 1, class = TRUE)
        e
}

lst <- list(classifier=list( 
        "weka.classifiers.functions.LinearRegression",
        "weka.classifiers.trees.M5P",
        "weka.classifiers.trees.REPTree",
        "weka.classifiers.trees.RandomForest"))

summaries <- pmap(.l=lst, .f=weka_summary, dfr=samp)
names(summaries) <-lst[["classifier"]]
summaries

## $weka.classifiers.functions.LinearRegression
## === 10 Fold Cross Validation ===
## 
## === Summary ===
## 
## Correlation coefficient                  0.3951
## Mean absolute error                      1.1959
## Root mean squared error                  2.9421
## Relative absolute error                 93.3773 %
## Root relative squared error             91.8547 %
## Total Number of Instances             3911     
## 
## $weka.classifiers.trees.M5P
## === 10 Fold Cross Validation ===
## 
## === Summary ===
## 
## Correlation coefficient                  0.6026
## Mean absolute error                      0.9266
## Root mean squared error                  2.5573
## Relative absolute error                 72.3515 %
## Root relative squared error             79.8397 %
## Total Number of Instances             3911     
## 
## $weka.classifiers.trees.REPTree
## === 10 Fold Cross Validation ===
## 
## === Summary ===
## 
## Correlation coefficient                  0.6526
## Mean absolute error                      0.9654
## Root mean squared error                  2.4264
## Relative absolute error                 75.3802 %
## Root relative squared error             75.7531 %
## Total Number of Instances             3911     
## 
## $weka.classifiers.trees.RandomForest
## === 10 Fold Cross Validation ===
## 
## === Summary ===
## 
## Correlation coefficient                  0.6934
## Mean absolute error                      0.8438
## Root mean squared error                  2.3088
## Relative absolute error                 65.8845 %
## Root relative squared error             72.0818 %
## Total Number of Instances             3911

r2_max_idx <- which.max(
        map(summaries, function(x) x$details[1])
        )
ans <- list(learner=lst[["classifier"]][[r2_max_idx]],corr.coeff= summaries[[r2_max_idx]]$details[[1]])

Answer: weka.classifiers.trees.RandomForest, 0.6934

Downsampling

We now examine the effect of preprocessing the data, using the results of these classifiers as a benchmark. We investigate three commonly used techniques for NIR data: 1. downsampling, 2. row normalisation 3. a signal smoothing method called Savitzky-Golay.

Downsampling is a simple method that can accelerate processing with little loss in accuracy (this may also allow slower classification methods to be applied without too much delay).

By hand, remove every second attribute, W350, W370, … W2490. The resulting dataset will have 109 attributes including the class (you may wish to save it).

Run the benchmark classifiers (again with default settings), along with 10-fold cross-validation. You will probably notice that they are faster than before. We will continue to use the correlation coefficient as our measure of success.

Which classifier performs worst on the downsampled data?

bad_cols <- seq(350,2490,20) %>% paste0("W", .)
bad_cols <- which(colnames(samp) %in% bad_cols)
samp_downsampled <- samp %>% 
        select(-bad_cols) 

summaries_downs <- pmap(.l=lst, .f=weka_summary, dfr=samp_downsampled)
names(summaries_downs) <-lst[["classifier"]]
summaries_downs

## $weka.classifiers.functions.LinearRegression
## === 10 Fold Cross Validation ===
## 
## === Summary ===
## 
## Correlation coefficient                  0.4006
## Mean absolute error                      1.1931
## Root mean squared error                  2.9344
## Relative absolute error                 93.1619 %
## Root relative squared error             91.6136 %
## Total Number of Instances             3911     
## 
## $weka.classifiers.trees.M5P
## === 10 Fold Cross Validation ===
## 
## === Summary ===
## 
## Correlation coefficient                  0.6301
## Mean absolute error                      0.8939
## Root mean squared error                  2.4865
## Relative absolute error                 69.8017 %
## Root relative squared error             77.6304 %
## Total Number of Instances             3911     
## 
## $weka.classifiers.trees.REPTree
## === 10 Fold Cross Validation ===
## 
## === Summary ===
## 
## Correlation coefficient                  0.6541
## Mean absolute error                      0.9551
## Root mean squared error                  2.4229
## Relative absolute error                 74.5737 %
## Root relative squared error             75.6454 %
## Total Number of Instances             3911     
## 
## $weka.classifiers.trees.RandomForest
## === 10 Fold Cross Validation ===
## 
## === Summary ===
## 
## Correlation coefficient                  0.6941
## Mean absolute error                      0.8442
## Root mean squared error                  2.3056
## Relative absolute error                 65.9213 %
## Root relative squared error             71.9807 %
## Total Number of Instances             3911

r2_min_idx <- which.min(
        map(summaries_downs, function(x) x$details[1])
        )

(ans_downs <- list(learner=lst[["classifier"]][[r2_min_idx]],corr.coeff= summaries_downs[[r2_min_idx]]$details[[1]]))

## $learner
## [1] "weka.classifiers.functions.LinearRegression"
## 
## $corr.coeff
## [1] 0.4006

Downsampling has improved both speed and accuracy for all these classifiers. Let’s keep going: make the dataset half the size again! Construct a new dataset with 55 attributes: the class and wavebands W380, W420, W460, … W2500. Run the benchmark again.

bad_cols_2 <- seq(380,2500,40) %>% paste0("W", .)
bad_cols_2 <- which(colnames(samp_downsampled) %in% bad_cols_2)
samp_downsampled_2 <- samp_downsampled %>% 
        select(-bad_cols_2) 


summaries_downs_2 <- pmap(.l=lst, .f=weka_summary, dfr=samp_downsampled_2)
names(summaries_downs_2) <-lst[["classifier"]]
summaries_downs_2

## $weka.classifiers.functions.LinearRegression
## === 10 Fold Cross Validation ===
## 
## === Summary ===
## 
## Correlation coefficient                  0.3855
## Mean absolute error                      1.1889
## Root mean squared error                  2.9552
## Relative absolute error                 92.8312 %
## Root relative squared error             92.2639 %
## Total Number of Instances             3911     
## 
## $weka.classifiers.trees.M5P
## === 10 Fold Cross Validation ===
## 
## === Summary ===
## 
## Correlation coefficient                  0.5776
## Mean absolute error                      0.9249
## Root mean squared error                  2.6159
## Relative absolute error                 72.2181 %
## Root relative squared error             81.6698 %
## Total Number of Instances             3911     
## 
## $weka.classifiers.trees.REPTree
## === 10 Fold Cross Validation ===
## 
## === Summary ===
## 
## Correlation coefficient                  0.6151
## Mean absolute error                      0.9775
## Root mean squared error                  2.525 
## Relative absolute error                 76.3258 %
## Root relative squared error             78.8329 %
## Total Number of Instances             3911     
## 
## $weka.classifiers.trees.RandomForest
## === 10 Fold Cross Validation ===
## 
## === Summary ===
## 
## Correlation coefficient                  0.6928
## Mean absolute error                      0.8491
## Root mean squared error                  2.311 
## Relative absolute error                 66.2989 %
## Root relative squared error             72.1519 %
## Total Number of Instances             3911

r2_downs <- map2_df(summaries_downs_2, summaries_downs, function(x, y) data.frame( r2_downsampled_2=x$details[1] ,r2_downsampled_1= y$details[1], diff=x$details[1] - y$details[1]))


r2_max_idx <- which.max(r2_downs$diff)

ans_downs_2 <- data.frame(learner=lst[["classifier"]][[r2_max_idx]],corr.coeff= summaries_downs_2[[r2_max_idx]]$details[[1]])

Compared to the first downsampling benchmark, which classifier gives the most improvement?

kable(cbind(names=names(summaries_downs_2),map2_df(summaries_downs,summaries_downs_2, function(x,y)  data.frame( r2_before=x$details[[1]], r2_after=y$details[[1]], r2_diff=y$details[[1]] - x$details[[1]]))),
               caption="Comparison of 2 downsampling steps, corr. coeff r²")

Comparison of 2 downsampling steps, corr. coeff r²
names	r2_before	r2_after	r2_diff
weka.classifiers.functions.LinearRegression	0.4006	0.3855	-0.0150
weka.classifiers.trees.M5P	0.6301	0.5776	-0.0526
weka.classifiers.trees.REPTree	0.6541	0.6151	-0.0390
weka.classifiers.trees.RandomForest	0.6941	0.6928	-0.0014

Next we look at row normalization, a “scatter correction” technique that is designed to address the problem of baseline effects to which all NIR devices are susceptible. Unfortunately, Weka does not (yet!) have a filter for row normalization, so we provide a new dataset, org_c_no_missing-rn. Load it into Weka and run the benchmark again. Note that this method does not remove data, so we are back to 217 attributes (including the class).

Row normalization improves results for two of the methods, compared to the original (non-downsampled) result. Which method gains most?

samp_norm <- read.arff(file="org_c_no_missing-rn.arff")
        

summaries_norm <- pmap(.l=lst, .f=weka_summary, dfr=samp_norm)
names(summaries_norm) <-lst[["classifier"]]
summaries_norm

## $weka.classifiers.functions.LinearRegression
## === 10 Fold Cross Validation ===
## 
## === Summary ===
## 
## Correlation coefficient                  0.513 
## Mean absolute error                      1.148 
## Root mean squared error                  2.7498
## Relative absolute error                 89.6384 %
## Root relative squared error             85.8504 %
## Total Number of Instances             3911     
## 
## $weka.classifiers.trees.M5P
## === 10 Fold Cross Validation ===
## 
## === Summary ===
## 
## Correlation coefficient                  0.703 
## Mean absolute error                      0.8034
## Root mean squared error                  2.2891
## Relative absolute error                 62.7346 %
## Root relative squared error             71.4686 %
## Total Number of Instances             3911     
## 
## $weka.classifiers.trees.REPTree
## === 10 Fold Cross Validation ===
## 
## === Summary ===
## 
## Correlation coefficient                  0.4408
## Mean absolute error                      1.0717
## Root mean squared error                  2.9172
## Relative absolute error                 83.6822 %
## Root relative squared error             91.0777 %
## Total Number of Instances             3911     
## 
## $weka.classifiers.trees.RandomForest
## === 10 Fold Cross Validation ===
## 
## === Summary ===
## 
## Correlation coefficient                  0.633 
## Mean absolute error                      0.8185
## Root mean squared error                  2.4872
## Relative absolute error                 63.9111 %
## Root relative squared error             77.6511 %
## Total Number of Instances             3911

r2_max_idx <- which.max(
        map(summaries_norm, function(x) x$details[1] ))

ans_norm_diffs <- cbind(algorithm=names(summaries_norm),map2_df(summaries_norm,summaries, function(x,y)  data.frame( r2_orig=y$details[[1]], r2_normalized=x$details[[1]], difference=x$details[[1]] - y$details[[1]])))

kable(ans_norm_diffs, caption = "Comparison unnormalized vs normalized data, corr.coeff r²")

Comparison unnormalized vs normalized data, corr.coeff r²
algorithm	r2_orig	r2_normalized	difference
weka.classifiers.functions.LinearRegression	0.3951	0.5130	0.1179
weka.classifiers.trees.M5P	0.6026	0.7030	0.1004
weka.classifiers.trees.REPTree	0.6526	0.4408	-0.2118
weka.classifiers.trees.RandomForest	0.6934	0.6330	-0.0604

r2_max_idx <- which.max(ans_norm_diffs$difference)

(ans_norm <- list(learner=lst[["classifier"]][[r2_max_idx]],
                     corr.coeff= summaries_norm[[r2_max_idx]]$details[[1]]))

## $learner
## [1] "weka.classifiers.functions.LinearRegression"
## 
## $corr.coeff
## [1] 0.513

Answer: weka.classifiers.functions.LinearRegression, 0.513

The spectral derivative is a third preprocessing tool: it smooths the spectral signal. One of the most prominent methods is called Savitzky-Golay, which corrects (smooths) each point using a fixed-width window centered on the point; the window’s width is a parameter. Again, this method is not in Weka, so we have produced datasets that smooth the signal using windows of two different sizes, 7 points (3 either side) and 11 points (5 either side): org_c_no_missing-sg7 and org_c_no_missing-sg11. Upon loading them you will notice that the wave bands have been replaced with generic names because the technique has altered the original attributes. Run the benchmark once more.

What is the best technique when Savitzky-Golay preprocessing is used?

samp_savgol <- map_df(samp_norm %>% select(-OrganicCarbon), savgol, fl=7, forder = 2, dorder = 0)
samp_savgol <- cbind(samp_savgol, samp_norm$OrganicCarbon)

samp_savgol_7 <- read.arff(file="org_c_no_missing-sg7.arff")
samp_savgol_11 <- read.arff(file="org_c_no_missing-sg11.arff")
        

summaries_savgol_7 <- pmap(.l=lst, .f=weka_summary, dfr=samp_savgol_7)
names(summaries_savgol_7) <-lst[["classifier"]]
summaries_savgol_7

## $weka.classifiers.functions.LinearRegression
## === 10 Fold Cross Validation ===
## 
## === Summary ===
## 
## Correlation coefficient                  0.6436
## Mean absolute error                      1.1741
## Root mean squared error                  2.4534
## Relative absolute error                 91.6766 %
## Root relative squared error             76.5982 %
## Total Number of Instances             3911     
## 
## $weka.classifiers.trees.M5P
## === 10 Fold Cross Validation ===
## 
## === Summary ===
## 
## Correlation coefficient                  0.8455
## Mean absolute error                      0.6531
## Root mean squared error                  1.7293
## Relative absolute error                 50.9957 %
## Root relative squared error             53.9885 %
## Total Number of Instances             3911     
## 
## $weka.classifiers.trees.REPTree
## === 10 Fold Cross Validation ===
## 
## === Summary ===
## 
## Correlation coefficient                  0.7649
## Mean absolute error                      1.0013
## Root mean squared error                  2.0974
## Relative absolute error                 78.1868 %
## Root relative squared error             65.4822 %
## Total Number of Instances             3911     
## 
## $weka.classifiers.trees.RandomForest
## === 10 Fold Cross Validation ===
## 
## === Summary ===
## 
## Correlation coefficient                  0.8533
## Mean absolute error                      0.6248
## Root mean squared error                  1.7402
## Relative absolute error                 48.7829 %
## Root relative squared error             54.3292 %
## Total Number of Instances             3911

summaries_savgol_11 <- pmap(.l=lst, .f=weka_summary, dfr=samp_savgol_11)
names(summaries_savgol_11) <-lst[["classifier"]]


ans_savgol_max <- unlist(map(c( summaries_savgol_7, summaries_savgol_11), function(x)  x$details[[1]]))


ans_savgol_max_df = data.frame(classifier=names(ans_savgol_max), correl.coef=as.vector(ans_savgol_max), window_size=paste0("size ", c(rep("7", length(summaries_savgol_7)), rep("11", length(summaries_savgol_11)))))

# Output results as table
kable(xtabs(correl.coef ~ classifier + window_size, ans_savgol_max_df), caption = "Correlation coefficients for Savitzky-Golay Window Sizes")

Correlation coefficients for Savitzky-Golay Window Sizes
	size 11	size 7
weka.classifiers.functions.LinearRegression	0.5771	0.6436
weka.classifiers.trees.M5P	0.8262	0.8455
weka.classifiers.trees.RandomForest	0.8499	0.8533
weka.classifiers.trees.REPTree	0.6936	0.7649

# More comprehensive:

summaries_savgol_11

## $weka.classifiers.functions.LinearRegression
## === 10 Fold Cross Validation ===
## 
## === Summary ===
## 
## Correlation coefficient                  0.5771
## Mean absolute error                      1.201 
## Root mean squared error                  2.6163
## Relative absolute error                 93.7764 %
## Root relative squared error             81.6833 %
## Total Number of Instances             3911     
## 
## $weka.classifiers.trees.M5P
## === 10 Fold Cross Validation ===
## 
## === Summary ===
## 
## Correlation coefficient                  0.8262
## Mean absolute error                      0.6729
## Root mean squared error                  1.8072
## Relative absolute error                 52.5421 %
## Root relative squared error             56.422  %
## Total Number of Instances             3911     
## 
## $weka.classifiers.trees.REPTree
## === 10 Fold Cross Validation ===
## 
## === Summary ===
## 
## Correlation coefficient                  0.6936
## Mean absolute error                      0.9145
## Root mean squared error                  2.3122
## Relative absolute error                 71.4095 %
## Root relative squared error             72.1871 %
## Total Number of Instances             3911     
## 
## $weka.classifiers.trees.RandomForest
## === 10 Fold Cross Validation ===
## 
## === Summary ===
## 
## Correlation coefficient                  0.8499
## Mean absolute error                      0.6315
## Root mean squared error                  1.7573
## Relative absolute error                 49.3077 %
## Root relative squared error             54.8645 %
## Total Number of Instances             3911

One of these window sizes is better across all classifiers. Which one?

Answer:

An example. Check yourself which is bigger:

(ans_savgol_diffs <- unlist(map2(summaries_savgol_11, summaries_savgol_7, function(x,y)  list( savgol7=y$details[[1]], savgol11=x$details[[1]]))))[1:2]

##  weka.classifiers.functions.LinearRegression.savgol7 
##                                               0.6436 
## weka.classifiers.functions.LinearRegression.savgol11 
##                                               0.5771

We have seen that preprocessing can make a big difference to the performance of a classifier. So far, three different techniques have been applied independently. What if we combine them? We downsampled the original file by removing every second attribute, then applied Savitzky-Golay, then row normalization, to produce org_c_no_missing-d2sg7rn. Load this dataset and re-run the benchmark.

For one of the classifiers, this combination produces the best result of all preprocessing techniques. Which one?

samp_savgol_7_norm <- read.arff(file="org_c_no_missing-d2sg7rn.arff")

summaries_savgol_7_norm <- pmap(.l=lst, .f=weka_summary, dfr=samp_savgol_7_norm)
names(summaries_savgol_7_norm) <-lst[["classifier"]]
summaries_savgol_7_norm

## $weka.classifiers.functions.LinearRegression
## === 10 Fold Cross Validation ===
## 
## === Summary ===
## 
## Correlation coefficient                  0.6734
## Mean absolute error                      1.0782
## Root mean squared error                  2.3675
## Relative absolute error                 84.1895 %
## Root relative squared error             73.9135 %
## Total Number of Instances             3911     
## 
## $weka.classifiers.trees.M5P
## === 10 Fold Cross Validation ===
## 
## === Summary ===
## 
## Correlation coefficient                  0.8216
## Mean absolute error                      0.6938
## Root mean squared error                  1.8264
## Relative absolute error                 54.1761 %
## Root relative squared error             57.02   %
## Total Number of Instances             3911     
## 
## $weka.classifiers.trees.REPTree
## === 10 Fold Cross Validation ===
## 
## === Summary ===
## 
## Correlation coefficient                  0.4821
## Mean absolute error                      0.976 
## Root mean squared error                  2.8614
## Relative absolute error                 76.2103 %
## Root relative squared error             89.3357 %
## Total Number of Instances             3911     
## 
## $weka.classifiers.trees.RandomForest
## === 10 Fold Cross Validation ===
## 
## === Summary ===
## 
## Correlation coefficient                  0.8336
## Mean absolute error                      0.6739
## Root mean squared error                  1.8818
## Relative absolute error                 52.623  %
## Root relative squared error             58.7518 %
## Total Number of Instances             3911

Final remarks

Note that we have not examined the effects of parameter changes in either the classifiers or the preprocessing techniques (except the Savitzky-Golay window size).

You can perform much more experimentation in search of a good model! One problem faced in all application development is knowing when a result is good enough to be used in practice.

In our experience, the correlation coefficient needs to increase to 0.95-0.99 for this problem.

Our best result in this activity is 0.8336, still a long way off. Another important factor that we have not explored is the effect of outliers in regression problems. Filtering out outlier instances can make a huge difference to performance.

System / R Setup

Note that this RWeka package uses RWekajars 3.9.0.

sessionInfo()

## R version 3.2.2 (2015-08-14)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.1 LTS
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8          LC_NUMERIC=C                 
##  [3] LC_TIME=de_DE.UTF-8           LC_COLLATE=en_US.UTF-8       
##  [5] LC_MONETARY=de_DE.UTF-8       LC_MESSAGES=en_US.UTF-8      
##  [7] LC_PAPER=de_DE.UTF-8          LC_NAME=de_DE.UTF-8          
##  [9] LC_ADDRESS=de_DE.UTF-8        LC_TELEPHONE=de_DE.UTF-8     
## [11] LC_MEASUREMENT=de_DE.UTF-8    LC_IDENTIFICATION=de_DE.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] caret_6.0-70    ggplot2_2.1.0   lattice_0.20-33 knitr_1.13     
##  [5] purrr_0.2.2     dplyr_0.5.0     tidyr_0.5.1     RWeka_0.4-29   
##  [9] foreign_0.8-66  colorout_1.1-2 
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.6        highr_0.6          nloptr_1.0.4      
##  [4] formatR_1.4        plyr_1.8.4         iterators_1.0.8   
##  [7] tools_3.2.2        RWekajars_3.9.0-1  lme4_1.1-12       
## [10] digest_0.6.9       evaluate_0.9       tibble_1.1        
## [13] gtable_0.2.0       nlme_3.1-128       mgcv_1.8-12       
## [16] Matrix_1.2-6       foreach_1.4.3      DBI_0.4-1         
## [19] parallel_3.2.2     yaml_2.1.13        SparseM_1.7       
## [22] rJava_0.9-8        stringr_1.0.0      MatrixModels_0.4-1
## [25] stats4_3.2.2       grid_3.2.2         nnet_7.3-12       
## [28] R6_2.1.2           rmarkdown_1.0      minqa_1.2.4       
## [31] reshape2_1.4.1     car_2.1-2          magrittr_1.5      
## [34] splines_3.2.2      scales_0.4.0       codetools_0.2-14  
## [37] htmltools_0.3.5    MASS_7.3-45        assertthat_0.1    
## [40] pbkrtest_0.4-4     colorspace_1.2-6   quantreg_5.26     
## [43] stringi_1.1.1      lazyeval_0.2.0     munsell_0.4.3

Can Total Organic Carbon of Soils be Predicted from Near-IR Spectroscopy Data?

Knut Behrends, @sudo_f on Twitter

July 27, 2016

Introduction

Activity / Exercise Text

Lesson 1.6 Activity: Analyzing a soil sample

1. Preprocessing

Downsampling

Answer:

Final remarks

System / R Setup