Searching for data to fullfill the requirements of the Google Data Analytics Capstone:” Complete a Case Study”, to be published in Rpubs, I have found this set of data at: Riesse and Keller. It seems interesting to me (I am a Chemist), so I decided to use it.
The following text is taken from the Markdown file written by the authors:
———– (Begining of the cite)———–
Hyperspectral and soil-moisture data from a field campaign based on a soil sample. Karlsruhe (Germany), 2017.
Introducing paper: Felix M. Riese and Sina Keller, Introducing a Framework of Self-Organizing Maps for Regression of Soil Moisture with Hyperspectral Data, in 2018 IEEE International Geoscience and Remote Sensing Symposium (IGARSS), Valencia, Spain, 2018, accepted.
License: GNU GPLv2
Authors:
Citation of the dataset: TODO
This dataset was measured in a five-day field campaign in May 2017 in Karlsruhe, Germany. An undisturbed soil sample is the centerpiece of the measurement setup. The soil sample consists of bare soil without any vegetation and was taken in the area near Waldbronn, Germany.
The following sensors were deployed:
———– (End of the cite)———–
The inspection of the table gave us the idea of uniform intensity datain the rage 0.12-0.15 for all wavelengths, however we made a check-up of the sum of all intensities for each wavelengthy, and we descovery a point at 825nm out of the range (125),while the others valuees falls in the above described one. We have two options: delete the entire observation (row) or manually correct the value, as it seems a error due to manipulation (zip and unzip de data, download, etc) our option was to correct the data to 0.125, in line with similar (above and below) values. After that we prepared the graphical output:
First I made a set up a hook to save rgl plots (No
need for a hook with rgl version >= 0.96.0, just call
rglwidget()):
Now are plotted the moisture index
library(rgl)
rgl.viewpoint(theta=25,phi = 25)
setwd("D://Documents//Certificado-Google//Data Analytics//Portfolio//CaseStudy-I")
x_x <- read.csv("soilmoisture_xData.csv")
z_z <- read.csv("soilmoisture_zData.csv")
y_y <-read.csv("soilmoisture_yData.csv")
#my_color <- ifelse(A<=8.5 & A>8,"red", ifelse(A<=7.5 & A>7,"blue","green"))
plot3d( y_y[,1], unlist(z_z),x_x[,1], type = 'p',xlab="moist",zlab="lambda", ylab="I", bty="b2",col= y_y[,1], screen = list(x = 45),box=FALSE)
rglwidget()
As can be seen the levels is shown as small points with different color. With this possiblity, the ploted in the less confusing way, because some of the bands are strongly affected by moisturelevels, due to the difference intermoleculars forces at different water levels. The data showns not outliers. Now we can prepare the Data Analysis. For this purpose firstly we use PLS Analys, as implemented in the R package pls.
library(pls)
##
## Attaching package: 'pls'
## The following object is masked from 'package:stats':
##
## loadings
pls.options(plsralg = "simpls")
str_ <-"__________________________ PLSR results here __________________________________"
model <- plsr( y_y[,1] ~ ., data = z_z, validation = "LOO")
summary(model)
## Data: X dimension: 679 125
## Y dimension: 679 1
## Fit method: simpls
## Number of components considered: 125
##
## VALIDATION: RMSEP
## Cross-validated using 679 leave-one-out segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 3.648 2.412 2.018 1.903 1.708 1.687 1.541
## adjCV 3.648 2.412 2.018 1.903 1.708 1.687 1.541
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 1.392 1.331 1.292 1.278 1.283 1.276 1.284
## adjCV 1.392 1.331 1.292 1.278 1.283 1.276 1.284
## 14 comps 15 comps 16 comps 17 comps 18 comps 19 comps 20 comps
## CV 1.296 1.306 1.319 1.326 1.33 1.336 1.342
## adjCV 1.296 1.306 1.319 1.326 1.33 1.335 1.342
## 21 comps 22 comps 23 comps 24 comps 25 comps 26 comps 27 comps
## CV 1.342 1.341 1.343 1.343 1.342 1.344 1.344
## adjCV 1.342 1.341 1.343 1.343 1.342 1.344 1.344
## 28 comps 29 comps 30 comps 31 comps 32 comps 33 comps 34 comps
## CV 1.344 1.344 1.344 1.344 1.344 1.344 1.344
## adjCV 1.344 1.344 1.344 1.344 1.343 1.344 1.344
## 35 comps 36 comps 37 comps 38 comps 39 comps 40 comps 41 comps
## CV 1.344 1.344 1.344 1.344 1.344 1.344 1.344
## adjCV 1.344 1.344 1.344 1.344 1.344 1.344 1.344
## 42 comps 43 comps 44 comps 45 comps 46 comps 47 comps 48 comps
## CV 1.344 1.344 1.344 1.344 1.344 1.344 1.344
## adjCV 1.344 1.344 1.344 1.344 1.344 1.344 1.344
## 49 comps 50 comps 51 comps 52 comps 53 comps 54 comps 55 comps
## CV 1.344 1.344 1.344 1.344 1.344 1.344 1.344
## adjCV 1.344 1.344 1.344 1.344 1.344 1.343 1.344
## 56 comps 57 comps 58 comps 59 comps 60 comps 61 comps 62 comps
## CV 1.344 1.346 1.348 1.358 1.389 1.446 1.581
## adjCV 1.344 1.345 1.346 1.353 1.372 1.401 1.460
## 63 comps 64 comps 65 comps 66 comps 67 comps 68 comps 69 comps
## CV 1.924 2.469 3.426 4.937 6.794 8.897 11.166
## adjCV 1.617 1.759 1.999 2.508 3.285 4.537 6.374
## 70 comps 71 comps 72 comps 73 comps 74 comps 75 comps 76 comps
## CV 13.531 16.01 18.56 21.17 23.79 26.44 29.11
## adjCV 8.603 11.10 13.72 16.40 19.10 21.82 24.55
## 77 comps 78 comps 79 comps 80 comps 81 comps 82 comps 83 comps
## CV 31.79 34.47 37.16 39.86 42.56 45.27 47.98
## adjCV 27.28 30.02 32.76 35.50 38.25 40.99 43.74
## 84 comps 85 comps 86 comps 87 comps 88 comps 89 comps 90 comps
## CV 50.69 53.40 56.11 58.83 61.55 64.27 66.99
## adjCV 46.49 49.24 51.99 54.74 57.50 60.25 63.00
## 91 comps 92 comps 93 comps 94 comps 95 comps 96 comps 97 comps
## CV 69.71 72.43 75.15 77.87 80.60 83.32 86.05
## adjCV 65.76 68.51 71.26 74.02 76.77 79.53 82.28
## 98 comps 99 comps 100 comps 101 comps 102 comps 103 comps
## CV 88.77 91.5 94.22 96.95 99.68 102.40
## adjCV 85.04 87.8 90.55 93.31 96.06 98.82
## 104 comps 105 comps 106 comps 107 comps 108 comps 109 comps
## CV 105.1 107.9 110.6 113.3 116.0 118.8
## adjCV 101.6 104.3 107.1 109.8 112.6 115.4
## 110 comps 111 comps 112 comps 113 comps 114 comps 115 comps
## CV 121.5 124.2 127.0 129.7 132.4 135.1
## adjCV 118.1 120.9 123.6 126.4 129.1 131.9
## 116 comps 117 comps 118 comps 119 comps 120 comps 121 comps
## CV 137.9 140.6 143.3 146.1 148.8 151.5
## adjCV 134.7 137.4 140.2 142.9 145.7 148.4
## 122 comps 123 comps 124 comps 125 comps
## CV 154.2 157 159.7 162.4
## adjCV 151.2 154 156.7 159.5
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps
## X 98.89 99.50 99.76 99.79 99.95 99.96 99.97
## y_y[, 1] 56.38 69.68 73.30 78.83 79.48 83.09 86.53
## 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps 14 comps
## X 99.97 99.98 99.98 99.98 99.98 99.98 99.98
## y_y[, 1] 88.24 89.21 89.86 90.16 90.45 90.68 90.89
## 15 comps 16 comps 17 comps 18 comps 19 comps 20 comps 21 comps
## X 99.99 99.99 99.99 99.99 99.99 99.99 99.99
## y_y[, 1] 91.02 91.17 91.24 91.32 91.38 91.43 91.46
## 22 comps 23 comps 24 comps 25 comps 26 comps 27 comps 28 comps
## X 99.99 99.99 99.99 99.99 99.99 99.99 99.99
## y_y[, 1] 91.48 91.50 91.51 91.52 91.53 91.53 91.53
## 29 comps 30 comps 31 comps 32 comps 33 comps 34 comps 35 comps
## X 99.99 99.99 99.99 99.99 99.99 99.99 99.99
## y_y[, 1] 91.53 91.53 91.53 91.53 91.53 91.53 91.53
## 36 comps 37 comps 38 comps 39 comps 40 comps 41 comps 42 comps
## X 99.99 99.99 99.99 99.99 99.99 100.00 100.00
## y_y[, 1] 91.53 91.53 91.53 91.53 91.53 91.53 91.53
## 43 comps 44 comps 45 comps 46 comps 47 comps 48 comps 49 comps
## X 100.00 100.00 100.00 100.00 100.00 100.00 100.00
## y_y[, 1] 91.53 91.53 91.53 91.53 91.53 91.53 91.53
## 50 comps 51 comps 52 comps 53 comps 54 comps 55 comps 56 comps
## X 100.00 100.00 100.00 100.00 100.00 100.00 100.00
## y_y[, 1] 91.53 91.53 91.53 91.53 91.53 91.53 91.53
## 57 comps 58 comps 59 comps 60 comps 61 comps 62 comps 63 comps
## X 100.00 100.00 100.00 100.02 100.07 100.21 100.63
## y_y[, 1] 91.53 91.53 91.53 91.52 91.49 91.41 91.16
## 64 comps 65 comps 66 comps 67 comps 68 comps 69 comps 70 comps
## X 101.79 105.18 118.25 148.77 206.74 287.7 380.0
## y_y[, 1] 90.48 88.45 80.09 55.88 -10.84 -159.0 -415.4
## 71 comps 72 comps 73 comps 74 comps 75 comps 76 comps 77 comps
## X 476.4 574.5 673.1 771.9 870.7 969.5 1068
## y_y[, 1] -787.6 -1277.0 -1882.6 -2603.1 -3438.5 -4388.4 -5453
## 78 comps 79 comps 80 comps 81 comps 82 comps 83 comps 84 comps
## X 1167 1266 1365 1464 1563 1661 1760
## y_y[, 1] -6632 -7925 -9333 -10856 -12493 -14244 -16110
## 85 comps 86 comps 87 comps 88 comps 89 comps 90 comps 91 comps
## X 1859 1958 2057 2156 2254 2353 2452
## y_y[, 1] -18091 -20185 -22395 -24719 -27157 -29710 -32378
## 92 comps 93 comps 94 comps 95 comps 96 comps 97 comps 98 comps
## X 2551 2650 2749 2848 2946 3045 3144
## y_y[, 1] -35160 -38056 -41067 -44192 -47432 -50787 -54256
## 99 comps 100 comps 101 comps 102 comps 103 comps 104 comps
## X 3243 3342 3441 3539 3638 3737
## y_y[, 1] -57839 -61537 -65349 -69276 -73318 -77473
## 105 comps 106 comps 107 comps 108 comps 109 comps 110 comps
## X 3836 3935 4034 4132 4231 4330
## y_y[, 1] -81744 -86129 -90628 -95242 -99970 -104813
## 111 comps 112 comps 113 comps 114 comps 115 comps 116 comps
## X 4429 4528 4627 4726 4824 4923
## y_y[, 1] -109770 -114842 -120029 -125330 -130745 -136275
## 117 comps 118 comps 119 comps 120 comps 121 comps 122 comps
## X 5022 5121 5220 5319 5417 5516
## y_y[, 1] -141919 -147678 -153551 -159539 -165641 -171858
## 123 comps 124 comps 125 comps
## X 5615 5714 5813
## y_y[, 1] -178190 -184635 -191196
validationplot(model,ncomp=10, val.type="MSEP")
validationplot(model, ncomp=10, val.type="R2")
The calculations of the Root Mean Square Error of Prediction (RMSEP) shows that the inclusion of first 60 components does not increase the RMSEP, however after that there is a huge source of errors, but if we look closely, we can observed than after the 10th component, the CV raises again. The second part of this table (% of variance explained), shows that after the 5th variable, almost all the x variance is explained, however in the case of the moisture percentage (variable Y), 90% of the variance is explained after the inclusion of the 11th variable. For that reason, the above shown validation plots for MSEP and R2were made with only 10 variables, as well as the validation plot of predicted vs measured values following
plot(model, ncomp = 10, asp = 1, line = TRUE)
plot(model, plottype = "scores", comps = 1:10)
plot(model, "loadings", comps = 1:10, cex=0.5, pch=1, xlab = "nm") #labels = "numbers", legendpos = "bottom",
abline(h = 0)
plot(model, plottype = "coef", ncomp=1:5, legendpos = "bottomleft", xlab = "nm")
#predict(model, ncomp = 5, newdata = as.data.frame (XValidate))
str_<-"__________________________________________________________________________________"
#print(predict)
str_ <-"__________________________________________________________________________________"
plot(model, plottype = "correlation")
str_<-"__________________________________________________________________________________"
The correlation between both components is altmost perfect, all the points falls near or in the 100% circle.
We will not make predictions in this example (neither moisture% nor Temperature) but they can be made adapting the following code:
#define training and testing sets
train <- mtcars[1:25, c(“hp”, “mpg”, “disp”, “drat”, “wt”, “qsec”)] y_test <- mtcars[26:nrow(mtcars), c(“hp”)] test <- mtcars[26:nrow(mtcars), c(“mpg”, “disp”, “drat”, “wt”, “qsec”)]
model <- plsr(hp~mpg+disp+drat+wt+qsec, data=train, scale=TRUE, validation=“CV”) pcr_pred <- predict(model, test, ncomp=2)
sqrt(mean((pcr_pred - y_test)^2))
Now comming the same study for Temperatures. The same analysis of the obtained plots can be made here, only taking in to account that 8 variables here explain almost all the discussed values presented in the case of the moisture (remember that for moisture, were necessary at least 10 variables)
Again, I ran a PLSR calculations to find the key variables:
## Data: X dimension: 679 125
## Y dimension: 679 1
## Fit method: simpls
## Number of components considered: 125
##
## VALIDATION: RMSEP
## Cross-validated using 679 leave-one-out segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 4.664 3.666 3.18 2.92 2.394 2.254 2.139
## adjCV 4.664 3.666 3.18 2.92 2.394 2.254 2.139
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 2.001 1.928 1.868 1.872 1.882 1.897 1.895
## adjCV 2.000 1.928 1.868 1.872 1.881 1.897 1.894
## 14 comps 15 comps 16 comps 17 comps 18 comps 19 comps 20 comps
## CV 1.926 1.939 1.937 1.962 1.974 1.991 1.996
## adjCV 1.926 1.939 1.937 1.962 1.974 1.991 1.996
## 21 comps 22 comps 23 comps 24 comps 25 comps 26 comps 27 comps
## CV 2 1.999 2.000 2.002 2.009 2.009 2.009
## adjCV 2 1.998 1.999 2.002 2.008 2.009 2.008
## 28 comps 29 comps 30 comps 31 comps 32 comps 33 comps 34 comps
## CV 2.01 2.01 2.01 2.011 2.011 2.011 2.011
## adjCV 2.01 2.01 2.01 2.011 2.011 2.011 2.011
## 35 comps 36 comps 37 comps 38 comps 39 comps 40 comps 41 comps
## CV 2.011 2.011 2.011 2.011 2.011 2.011 2.011
## adjCV 2.011 2.011 2.011 2.011 2.011 2.011 2.011
## 42 comps 43 comps 44 comps 45 comps 46 comps 47 comps 48 comps
## CV 2.011 2.011 2.011 2.011 2.011 2.011 2.011
## adjCV 2.011 2.011 2.011 2.011 2.011 2.011 2.011
## 49 comps 50 comps 51 comps 52 comps 53 comps 54 comps 55 comps
## CV 2.011 2.011 2.011 2.011 2.011 2.01 2.012
## adjCV 2.011 2.011 2.011 2.011 2.010 2.01 2.011
## 56 comps 57 comps 58 comps 59 comps 60 comps 61 comps 62 comps
## CV 2.013 2.015 2.018 2.024 2.038 2.103 2.250
## adjCV 2.012 2.014 2.017 2.021 2.026 2.068 2.155
## 63 comps 64 comps 65 comps 66 comps 67 comps 68 comps 69 comps
## CV 2.472 3.029 4.339 5.872 7.847 10.089 12.527
## adjCV 2.198 2.328 2.889 3.282 3.964 5.028 6.703
## 70 comps 71 comps 72 comps 73 comps 74 comps 75 comps 76 comps
## CV 15.096 17.74 20.45 23.21 25.99 28.80 31.62
## adjCV 8.876 11.35 14.01 16.75 19.54 22.36 25.19
## 77 comps 78 comps 79 comps 80 comps 81 comps 82 comps 83 comps
## CV 34.46 37.30 40.15 43.00 45.86 48.72 51.59
## adjCV 28.04 30.89 33.74 36.61 39.47 42.34 45.21
## 84 comps 85 comps 86 comps 87 comps 88 comps 89 comps 90 comps
## CV 54.46 57.33 60.20 63.07 65.95 68.82 71.70
## adjCV 48.08 50.95 53.82 56.70 59.57 62.45 65.33
## 91 comps 92 comps 93 comps 94 comps 95 comps 96 comps 97 comps
## CV 74.58 77.46 80.34 83.22 86.10 88.98 91.86
## adjCV 68.21 71.09 73.96 76.84 79.72 82.60 85.48
## 98 comps 99 comps 100 comps 101 comps 102 comps 103 comps
## CV 94.74 97.62 100.51 103.39 106.28 109.2
## adjCV 88.36 91.25 94.13 97.01 99.89 102.8
## 104 comps 105 comps 106 comps 107 comps 108 comps 109 comps
## CV 112.0 114.9 117.8 120.7 123.6 126.5
## adjCV 105.7 108.5 111.4 114.3 117.2 120.1
## 110 comps 111 comps 112 comps 113 comps 114 comps 115 comps
## CV 129.4 132.2 135.1 138.0 140.9 143.8
## adjCV 122.9 125.8 128.7 131.6 134.5 137.4
## 116 comps 117 comps 118 comps 119 comps 120 comps 121 comps
## CV 146.7 149.6 152.4 155.3 158.2 161.1
## adjCV 140.2 143.1 146.0 148.9 151.8 154.7
## 122 comps 123 comps 124 comps 125 comps
## CV 164.0 166.9 169.8 172.7
## adjCV 157.5 160.4 163.3 166.2
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps
## X 98.89 99.39 99.75 99.82 99.95 99.96 99.97
## y_y[, 2] 38.36 54.29 61.97 74.80 77.62 80.78 84.08
## 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps 14 comps
## X 99.97 99.98 99.98 99.98 99.98 99.98 99.98
## y_y[, 2] 85.69 86.77 87.40 87.74 88.25 88.50 88.67
## 15 comps 16 comps 17 comps 18 comps 19 comps 20 comps 21 comps
## X 99.99 99.99 99.99 99.99 99.99 99.99 99.99
## y_y[, 2] 88.77 88.88 89.00 89.06 89.11 89.15 89.19
## 22 comps 23 comps 24 comps 25 comps 26 comps 27 comps 28 comps
## X 99.99 99.99 99.99 99.99 99.99 99.99 99.99
## y_y[, 2] 89.21 89.23 89.24 89.24 89.25 89.25 89.26
## 29 comps 30 comps 31 comps 32 comps 33 comps 34 comps 35 comps
## X 99.99 99.99 99.99 99.99 99.99 99.99 99.99
## y_y[, 2] 89.26 89.26 89.26 89.26 89.26 89.26 89.26
## 36 comps 37 comps 38 comps 39 comps 40 comps 41 comps 42 comps
## X 99.99 99.99 99.99 99.99 99.99 100.00 100.00
## y_y[, 2] 89.26 89.26 89.26 89.26 89.26 89.26 89.26
## 43 comps 44 comps 45 comps 46 comps 47 comps 48 comps 49 comps
## X 100.00 100.00 100.00 100.00 100.00 100.00 100.00
## y_y[, 2] 89.26 89.26 89.26 89.26 89.26 89.26 89.26
## 50 comps 51 comps 52 comps 53 comps 54 comps 55 comps 56 comps
## X 100.00 100.00 100.00 100.00 100.00 100.00 100.00
## y_y[, 2] 89.26 89.26 89.26 89.26 89.26 89.26 89.26
## 57 comps 58 comps 59 comps 60 comps 61 comps 62 comps 63 comps
## X 100.00 100.00 100.00 100.01 100.04 100.1 100.42
## y_y[, 2] 89.26 89.26 89.26 89.25 89.24 89.2 89.09
## 64 comps 65 comps 66 comps 67 comps 68 comps 69 comps 70 comps
## X 101.33 105.00 113.43 136.85 186.1 261.60 349.1
## y_y[, 2] 88.74 87.29 83.72 72.27 39.2 -40.34 -183.9
## 71 comps 72 comps 73 comps 74 comps 75 comps 76 comps 77 comps
## X 443.5 541.3 639.9 738.7 837.6 936.5 1035
## y_y[, 2] -403.1 -701.7 -1078.1 -1530.9 -2059.9 -2664.9 -3346
## 78 comps 79 comps 80 comps 81 comps 82 comps 83 comps 84 comps
## X 1134 1233 1332 1431 1530 1629 1728
## y_y[, 2] -4103 -4936 -5845 -6831 -7892 -9029 -10242
## 85 comps 86 comps 87 comps 88 comps 89 comps 90 comps 91 comps
## X 1826 1925 2024 2123 2222 2321 2420
## y_y[, 2] -11532 -12897 -14338 -15856 -17449 -19118 -20864
## 92 comps 93 comps 94 comps 95 comps 96 comps 97 comps 98 comps
## X 2519 2618 2716 2815 2914 3013 3112
## y_y[, 2] -22685 -24583 -26556 -28606 -30731 -32933 -35210
## 99 comps 100 comps 101 comps 102 comps 103 comps 104 comps
## X 3211 3310 3409 3507 3606 3705
## y_y[, 2] -37564 -39993 -42499 -45081 -47738 -50472
## 105 comps 106 comps 107 comps 108 comps 109 comps 110 comps
## X 3804 3903 4002 4101 4200 4299
## y_y[, 2] -53282 -56168 -59129 -62167 -65281 -68471
## 111 comps 112 comps 113 comps 114 comps 115 comps 116 comps
## X 4397 4496 4595 4694 4793 4892
## y_y[, 2] -71737 -75078 -78496 -81990 -85560 -89206
## 117 comps 118 comps 119 comps 120 comps 121 comps 122 comps
## X 4991 5090 5189 5287 5386 5485
## y_y[, 2] -92928 -96726 -100600 -104550 -108576 -112679
## 123 comps 124 comps 125 comps
## X 5584 5683 5782
## y_y[, 2] -116857 -121111 -125441
#Conclusions The datasert used here demnstrates a goode correlation between wavelength intensity and both the Moisture Level and the temperature, at least for this Germany region