Welcome to the second script of Module 5! Here, we will use Random Forest models to predict historical crop yields with climate and weather variables aggregated for different crop growth phases. You will learn how to:
Please read the following code instructions carefully, try to understand the code that follows each instruction, execute it and see what happens. Do not hesitate to change the code or write your own code, and execute it as well!
As usual, we start by loading the packages we need, and by defining our working directory:
library(randomForest)
library(ggplot2)
library(corrplot)
library(pdp)
library(viridis)
dir <- "C:/Users/Max/Desktop/IAMO_neu/eLearning/Module_5_Random_Forests/M05_Part02_Climate/"
Import the data for winter wheat. It contains one observation for each year and marz. The variable “yield” is the response variable and is based on the official statistics reported by ARMSTAT.
There are 11 predictors for each growth phase. In this case, winter wheat has five phases:
(see Table 3 in the main report of WP3).
And there are the following 11 climate and weather parameters that were calculated from daily CHIRPS precipitation and ERA5-temperature data that was aggregated for the respective phase and marz:
Factor levels are not saved in CSV files. Therefore, we have to convert “region” after importing the data. Let us also exclude all incomplete observations, i.e. observations that are “NA” for at least one predictor variable:
data <- read.csv(paste0(dir, "data/historical_climate/winterwheat_historical.csv"))
head(data)
| region | year | yield | PRCP_phaseA | TAVG_phaseA | TMIN_phaseA | TMAX_phaseA | DH_phaseA | DHW_phaseA | NH_phaseA | NHW_phaseA | HP_phaseA | FR_phaseA | GDD_phaseA | PRCP_phaseB | TAVG_phaseB | TMIN_phaseB | TMAX_phaseB | DH_phaseB | DHW_phaseB | NH_phaseB | NHW_phaseB | HP_phaseB | FR_phaseB | GDD_phaseB | PRCP_phaseC | TAVG_phaseC | TMIN_phaseC | TMAX_phaseC | DH_phaseC | DHW_phaseC | NH_phaseC | NHW_phaseC | HP_phaseC | FR_phaseC | GDD_phaseC | PRCP_phaseD | TAVG_phaseD | TMIN_phaseD | TMAX_phaseD | DH_phaseD | DHW_phaseD | NH_phaseD | NHW_phaseD | HP_phaseD | FR_phaseD | GDD_phaseD | PRCP_phaseE | TAVG_phaseE | TMIN_phaseE | TMAX_phaseE | DH_phaseE | DHW_phaseE | NH_phaseE | NHW_phaseE | HP_phaseE | FR_phaseE | GDD_phaseE |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Aragatsotn | 2005 | 21.0 | 43.24110 | 8.801186 | 3.662232 | 14.40586 | 0 | 0 | 0 | 0 | 0.000000 | -22.58002 | 589.6795 | 190.4587 | -5.592452 | -10.014605 | -0.8960821 | 0 | 0 | 0 | 0 | 0.0000000 | -1443.4364 | 0 | 107.9768 | 8.102834 | 3.011731 | 13.10697 | 0.000000 | 0.000000 | 0 | 0 | 0.0000000 | -5.785085 | 324.1134 | 121.57942 | 14.10018 | 8.130387 | 19.71829 | 0 | 0 | 0 | 0 | 6.556601 | 0 | 719.1092 | 54.70011 | 19.22392 | 13.50051 | 24.97799 | 0 | 0 | 0 | 0 | 0.000000 | 0 | 884.3002 |
| Aragatsotn | 2006 | 14.2 | 103.51512 | 8.433505 | 3.642841 | 13.86310 | 0 | 0 | 0 | 0 | 12.640926 | -24.17632 | 463.8428 | 181.4930 | -2.939045 | -6.931621 | 1.5802151 | 0 | 0 | 0 | 0 | 0.0000000 | -1048.4495 | 0 | 210.0441 | 9.784795 | 4.072924 | 15.21274 | 12.331413 | 12.331413 | 0 | 0 | 6.3990479 | -7.323258 | 474.9240 | 29.14831 | 17.96123 | 11.053016 | 24.35719 | 0 | 0 | 0 | 0 | 0.897646 | 0 | 431.0694 | 69.14790 | 18.89822 | 13.07337 | 24.69152 | 0 | 0 | 0 | 0 | 0.000000 | 0 | 907.1144 |
| Aragatsotn | 2007 | 25.8 | 170.12327 | 12.675672 | 7.566744 | 18.25980 | 0 | 0 | 0 | 0 | 7.777229 | 0.00000 | 1128.1348 | 208.6596 | -3.551540 | -7.888055 | 1.3182586 | 0 | 0 | 0 | 0 | 0.5757751 | -1357.6673 | 0 | 143.9691 | 11.289824 | 5.381687 | 16.98843 | 10.096273 | 7.564598 | 0 | 0 | 7.1704102 | -3.143651 | 365.3237 | 28.97024 | 15.69509 | 9.933802 | 21.31315 | 0 | 0 | 0 | 0 | 0.000000 | 0 | 470.8527 | 117.21213 | 18.01944 | 12.28245 | 23.94591 | 0 | 0 | 0 | 0 | 4.556953 | 0 | 1243.3416 |
| Aragatsotn | 2008 | 22.4 | 125.70627 | 8.952586 | 4.092550 | 14.32815 | 0 | 0 | 0 | 0 | 0.000000 | -22.30723 | 511.8206 | 143.8115 | -6.161019 | -10.536147 | -1.3126604 | 0 | 0 | 0 | 0 | 12.7685776 | -1381.1310 | 0 | 190.3477 | 9.066595 | 3.042949 | 14.80560 | 4.863117 | 2.390143 | 0 | 0 | 0.0000000 | -31.244697 | 682.0981 | 90.31423 | 15.48589 | 9.610766 | 21.01814 | 0 | 0 | 0 | 0 | 32.556789 | 0 | 464.5767 | 30.53244 | 19.24053 | 13.47968 | 25.28650 | 0 | 0 | 0 | 0 | 0.000000 | 0 | 731.1402 |
| Aragatsotn | 2009 | 22.9 | 76.48413 | 9.026957 | 4.231709 | 14.40760 | 0 | 0 | 0 | 0 | 4.741770 | -28.23382 | 615.1244 | 157.4761 | -3.539272 | -7.738430 | 1.0963298 | 0 | 0 | 0 | 0 | 0.0000000 | -1060.2846 | 0 | 187.0446 | 6.575619 | 1.004866 | 11.95773 | 2.232709 | 0.000000 | 0 | 0 | 6.9939194 | -72.142856 | 418.2866 | 223.76986 | 15.43371 | 9.784316 | 20.83972 | 0 | 0 | 0 | 0 | 73.703217 | 0 | 586.4812 | 67.01230 | 16.35030 | 10.66225 | 22.11904 | 0 | 0 | 0 | 0 | 5.755676 | 0 | 833.8652 |
| Aragatsotn | 2010 | 20.0 | 142.17263 | 7.429945 | 2.704584 | 12.71059 | 0 | 0 | 0 | 0 | 4.878027 | -16.40441 | 423.5069 | 331.1603 | -1.032124 | -4.936632 | 3.0436822 | 0 | 0 | 0 | 0 | 8.0066032 | -748.4335 | 0 | 168.8895 | 8.651260 | 3.459903 | 13.61745 | 4.664848 | 0.000000 | 0 | 0 | 0.1802406 | -6.164672 | 397.3530 | 89.16552 | 16.70175 | 10.423801 | 22.52617 | 0 | 0 | 0 | 0 | 18.590811 | 0 | 467.6491 | 74.09034 | 19.42425 | 13.25001 | 25.29352 | 0 | 0 | 0 | 0 | 0.000000 | 0 | 738.1215 |
## convert "region" to a factor variable
data$region <- as.factor(data$region)
## delete observations with NAs
data <- data[complete.cases(data),]
We are interested in the 55 climate/weather predictors, and in “region” as another predictor to account for marz-level differences (e.g. in mechanization, policies, etc.). In a regression model, you would probably define “region” as a random effect. We are not interested in “year” and assume that yearly observations are independent from one another. There are predictors that are zero for all observations, for example there is no frost in summer during phases D and E. Let us identify discard all such variables, plus the column “year”:
## identify all variables that are all zero
delete <- NULL
for (j in 1:length(data))
{ if(length(unique(data[,j]))==1) {
if(unique(data[,j])==0) { delete <- c(delete,j) } } }
delete_vars <- names(data)[delete]
delete_vars
## [1] "DH_phaseA" "DHW_phaseA" "NH_phaseA" "NHW_phaseA" "DH_phaseB"
## [6] "DHW_phaseB" "NH_phaseB" "NHW_phaseB" "GDD_phaseB" "NH_phaseC"
## [11] "NHW_phaseC" "DH_phaseD" "DHW_phaseD" "NH_phaseD" "NHW_phaseD"
## [16] "FR_phaseE"
## delete variables
data <- data[, which(names(data)%in% c(delete_vars, "year")==F)]
We have just reduced our dataset to 40 predictors. As all variables are derived from precipitation and temperature data, we can expect that some predictors are highly correlated. Random Forests can handle collinearity quite well, but we should still try to keep it low and reduce those variables that have the highest effects. Let’s calculate the Pearson correlation coefficient for all combinations of numerical predictors with cor(), and plot it using corrplot():
dim(data)
## [1] 160 41
corcoeffs <- cor(data[c(3:41)], method = "pearson")
corrplot(corcoeffs)
We see that there is high correlation for certain predictor combinations. We can reduce this by calculating the variance inflation factor (VIF) based on a linear regression model. To do so, we render the script “VIF.Rmd” which contains the definition of the function vif_func(), which we then apply to our dataset. This function iteratively removes one variable until all variables have a VIF below 10, which is considered a common threshold for acceptable predictor collinearity, and returns the names of those variables that can remain in the model.
We reduce our dataset to only include the variables contained in “final_vars” plus “yield” and “region”, and plot the correlation matrix one more time:
data <- data[, c(1, 2, which(names(data) %in% final_vars)),]
corrplot(cor(data[c(3:28)], method = "pearson"))
Our data set is now pre-processed and ready for the model. Let’s inspect it one more time:
dim(data)
## [1] 160 28
head(data)
| region | yield | PRCP_phaseA | HP_phaseA | FR_phaseA | GDD_phaseA | PRCP_phaseB | TMAX_phaseB | HP_phaseB | FR_phaseB | PRCP_phaseC | DHW_phaseC | HP_phaseC | FR_phaseC | GDD_phaseC | PRCP_phaseD | TMAX_phaseD | HP_phaseD | FR_phaseD | GDD_phaseD | PRCP_phaseE | TMAX_phaseE | DH_phaseE | DHW_phaseE | NH_phaseE | NHW_phaseE | HP_phaseE | GDD_phaseE |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Aragatsotn | 21.0 | 43.24110 | 0.000000 | -22.58002 | 589.6795 | 190.4587 | -0.8960821 | 0.0000000 | -1443.4364 | 107.9768 | 0.000000 | 0.0000000 | -5.785085 | 324.1134 | 121.57942 | 19.71829 | 6.556601 | 0 | 719.1092 | 54.70011 | 24.97799 | 0 | 0 | 0 | 0 | 0.000000 | 884.3002 |
| Aragatsotn | 14.2 | 103.51512 | 12.640926 | -24.17632 | 463.8428 | 181.4930 | 1.5802151 | 0.0000000 | -1048.4495 | 210.0441 | 12.331413 | 6.3990479 | -7.323258 | 474.9240 | 29.14831 | 24.35719 | 0.897646 | 0 | 431.0694 | 69.14790 | 24.69152 | 0 | 0 | 0 | 0 | 0.000000 | 907.1144 |
| Aragatsotn | 25.8 | 170.12327 | 7.777229 | 0.00000 | 1128.1348 | 208.6596 | 1.3182586 | 0.5757751 | -1357.6673 | 143.9691 | 7.564598 | 7.1704102 | -3.143651 | 365.3237 | 28.97024 | 21.31315 | 0.000000 | 0 | 470.8527 | 117.21213 | 23.94591 | 0 | 0 | 0 | 0 | 4.556953 | 1243.3416 |
| Aragatsotn | 22.4 | 125.70627 | 0.000000 | -22.30723 | 511.8206 | 143.8115 | -1.3126604 | 12.7685776 | -1381.1310 | 190.3477 | 2.390143 | 0.0000000 | -31.244697 | 682.0981 | 90.31423 | 21.01814 | 32.556789 | 0 | 464.5767 | 30.53244 | 25.28650 | 0 | 0 | 0 | 0 | 0.000000 | 731.1402 |
| Aragatsotn | 22.9 | 76.48413 | 4.741770 | -28.23382 | 615.1244 | 157.4761 | 1.0963298 | 0.0000000 | -1060.2846 | 187.0446 | 0.000000 | 6.9939194 | -72.142856 | 418.2866 | 223.76986 | 20.83972 | 73.703217 | 0 | 586.4812 | 67.01230 | 22.11904 | 0 | 0 | 0 | 0 | 5.755676 | 833.8652 |
| Aragatsotn | 20.0 | 142.17263 | 4.878027 | -16.40441 | 423.5069 | 331.1603 | 3.0436822 | 8.0066032 | -748.4335 | 168.8895 | 0.000000 | 0.1802406 | -6.164672 | 397.3530 | 89.16552 | 22.52617 | 18.590811 | 0 | 467.6491 | 74.09034 | 25.29352 | 0 | 0 | 0 | 0 | 0.000000 | 738.1215 |
str(data)
## 'data.frame': 160 obs. of 28 variables:
## $ region : Factor w/ 10 levels "Aragatsotn","Ararat",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ yield : num 21 14.2 25.8 22.4 22.9 20 26.6 25.5 29 31.5 ...
## $ PRCP_phaseA: num 43.2 103.5 170.1 125.7 76.5 ...
## $ HP_phaseA : num 0 12.64 7.78 0 4.74 ...
## $ FR_phaseA : num -22.6 -24.2 0 -22.3 -28.2 ...
## $ GDD_phaseA : num 590 464 1128 512 615 ...
## $ PRCP_phaseB: num 190 181 209 144 157 ...
## $ TMAX_phaseB: num -0.896 1.58 1.318 -1.313 1.096 ...
## $ HP_phaseB : num 0 0 0.576 12.769 0 ...
## $ FR_phaseB : num -1443 -1048 -1358 -1381 -1060 ...
## $ PRCP_phaseC: num 108 210 144 190 187 ...
## $ DHW_phaseC : num 0 12.33 7.56 2.39 0 ...
## $ HP_phaseC : num 0 6.4 7.17 0 6.99 ...
## $ FR_phaseC : num -5.79 -7.32 -3.14 -31.24 -72.14 ...
## $ GDD_phaseC : num 324 475 365 682 418 ...
## $ PRCP_phaseD: num 121.6 29.1 29 90.3 223.8 ...
## $ TMAX_phaseD: num 19.7 24.4 21.3 21 20.8 ...
## $ HP_phaseD : num 6.557 0.898 0 32.557 73.703 ...
## $ FR_phaseD : num 0 0 0 0 0 0 0 0 0 0 ...
## $ GDD_phaseD : num 719 431 471 465 586 ...
## $ PRCP_phaseE: num 54.7 69.1 117.2 30.5 67 ...
## $ TMAX_phaseE: num 25 24.7 23.9 25.3 22.1 ...
## $ DH_phaseE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ DHW_phaseE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ NH_phaseE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ NHW_phaseE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ HP_phaseE : num 0 0 4.56 0 5.76 ...
## $ GDD_phaseE : num 884 907 1243 731 834 ...
Again, we first have to split our data set into training and testing. Because we have a relatively small sample size of just 160, the random sampling might have a considerable effect on the model performance. We will therefore repeat the model “N” times, and each time select different samples that we assign as training and testing data set. We then perform the model with 300 trees and the default value for mtry (= 1/3 of the total amount of predictors). We apply the model to “data_test” and calculate the variable importance with the function importance() (“type=1” means that we calculate %IncMSE).
N <- 10
for (y in 1:N) {
set.seed(1000+y)
data_index <- sample(nrow(data), 0.7*nrow(data), replace = FALSE)
data_train <- data[data_index,]
data_test <- data[-data_index,]
model <- randomForest(yield ~ ., data = data_train, ntree = 300, importance = TRUE)
pred <- predict(model, data_test)
rsquare <- cor(pred, data_test$yield)
varimp <- importance(model, type=1)
}
In the code above, we overwrite “model”, “pred”, “rsquare” and “varimp” in each iteration, but we obviously want to save the results of all N models. Let us track the individual model predictions, R-square values and variable importances by creating some empty objects before we start the loop:
pred_track <- as.data.frame(matrix(ncol=N+1, nrow=NROW(data)))
pred_track[,1] <- data$yield
names(pred_track) <- c("observed", paste("prediction_run", c(1:N), sep="_"))
rsquare_track <- NULL
varimp_track <- as.data.frame(matrix(ncol=N+1, nrow=length(data)-1))
varimp_track[,1] <- names(data)[-2]
names(varimp_track) <- c("variable", paste("varimp_run", c(1:N), sep="_"))
for (y in 1:N) {
set.seed(1000+y)
data_index <- sample(nrow(data), 0.7*nrow(data), replace = FALSE)
data_train <- data[data_index,]
data_test <- data[-data_index,]
model <- randomForest(yield ~ ., data = data_train, ntree = 300, importance = TRUE)
## since we only predict a random sample of 30% of all observations, we have to put
## the predictions into the correct rows in "pred_track", in column y+1
pred <- predict(model, data_test)
pred_track[which(rownames(pred_track)%in%rownames(data_test)),y+1] <- pred
## we append the vector "rsquare_track" by one element
rsquare_track <- c(rsquare_track, cor(pred, data_test$yield))
## fill column y+1 in "varimp_track"
varimp_track[,y+1] <- importance(model, type=1)[,1]
}
Let’s have a look at our results - we see that the results were different in each iteration!
head(pred_track)
| observed | prediction_run_1 | prediction_run_2 | prediction_run_3 | prediction_run_4 | prediction_run_5 | prediction_run_6 | prediction_run_7 | prediction_run_8 | prediction_run_9 | prediction_run_10 |
|---|---|---|---|---|---|---|---|---|---|---|
| 21.0 | NA | 24.4328 | 22.46219 | NA | 22.61581 | 22.95399 | NA | NA | 24.75172 | 24.56811 |
| 14.2 | NA | NA | NA | NA | NA | NA | 19.52664 | NA | NA | NA |
| 25.8 | NA | NA | NA | NA | 20.93061 | NA | 22.94005 | NA | NA | NA |
| 22.4 | NA | NA | NA | 21.14468 | NA | NA | NA | NA | 21.37042 | NA |
| 22.9 | 22.16536 | NA | NA | NA | NA | NA | NA | NA | 23.31842 | NA |
| 20.0 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
rsquare_track
## [1] 0.6514028 0.6979258 0.7276733 0.8159774 0.7732982 0.7626575 0.7631490
## [8] 0.7667782 0.7239919 0.7610512
head(varimp_track)
| variable | varimp_run_1 | varimp_run_2 | varimp_run_3 | varimp_run_4 | varimp_run_5 | varimp_run_6 | varimp_run_7 | varimp_run_8 | varimp_run_9 | varimp_run_10 |
|---|---|---|---|---|---|---|---|---|---|---|
| region | 15.737528 | 14.906998 | 15.3575680 | 15.651557 | 17.2724899 | 19.6833368 | 15.5110523 | 17.2706946 | 14.5935207 | 16.7483408 |
| PRCP_phaseA | 3.008252 | 3.468303 | 0.7750815 | 2.076255 | 1.9137320 | 0.6688023 | 1.7306713 | 1.4738925 | 2.8369527 | 1.6667143 |
| HP_phaseA | -0.958062 | -1.346405 | 1.6947294 | -1.383351 | 0.3668462 | -1.4429276 | 0.2426819 | -0.9143026 | -0.0143967 | 0.6877902 |
| FR_phaseA | 3.929375 | 2.039395 | 3.9856482 | 6.313718 | 2.1238511 | 6.2639840 | 3.8621941 | 4.4144163 | 6.1980380 | 4.2545652 |
| GDD_phaseA | 1.932800 | -0.930060 | 0.5856092 | 2.644250 | -0.9637717 | 3.2308220 | -0.6191374 | 2.2904607 | 2.8290116 | 1.3773832 |
| PRCP_phaseB | 7.795576 | 10.348723 | 9.7403187 | 6.595456 | 6.8714872 | 6.9395852 | 10.3231374 | 8.6090900 | 8.6221115 | 5.8453857 |
Until this point, we have not calculated partial dependencies. In the last script, we created partial dependence plots with the function partialPlot(). However, this function only returned a plot but no actual data values. As we are here running N models and want to visualize the average partial dependencies of each variable across N model runs in one plot, we have to keep track of the actual underlying partial dependence values, which we can do with the function partial() from the package pdp. The function partial() returns the partial function value for a set of given values of the predictor variable defined in the argument “pred.grid”.
Let’s define 50 values of “PRCP_phaseA” as these values, that range from the minimum to the maximum observed value for “PRCP_phaseA”. We then calculate the partial dependence with the last definitions of “model” and “data_train”, and plot the resulting prediction “yhat” as a function of the 50 defined precipitation values:
minvar <- min(data[which(names(data) %in% "PRCP_phaseA")])
maxvar <- max(data[which(names(data) %in% "PRCP_phaseA")])
df <- as.data.frame(seq(from=minvar,to=maxvar, by=(maxvar-minvar)/50))
names(df) <- "PRCP_phaseA"
partialdep <- partial(model, pred.var = "PRCP_phaseA", pred.grid = df, train = data_train)
plot(partialdep$yhat~partialdep$PRCP_phaseA, type="l", xlab = "PRCP in phase A", ylab = "Yield in dt/ha")
Let’s calculate the partial dependencies for all numerical predictor variables and save the results in a list:
predictor_vars <- names(data)[-c(1:2)]
partialdep_track <- list()
for (p in 1:length(predictor_vars))
{ minvar <- min(data[which(names(data)%in%predictor_vars[p])])
maxvar <- max(data[which(names(data)%in%predictor_vars[p])])
df <- as.data.frame(seq(from=minvar,to=maxvar, by=(maxvar-minvar)/50))
names(df) <- predictor_vars[p]
partialdep_track[[p]] <- pdp::partial(model, pred.var = predictor_vars[p], pred.grid = df, train = data_train)
}
head(partialdep_track[[1]])
| PRCP_phaseA | yhat |
|---|---|
| 4.781048 | 25.52643 |
| 8.087892 | 25.52643 |
| 11.394737 | 25.52643 |
| 14.701581 | 25.52643 |
| 18.008425 | 25.53422 |
| 21.315270 | 25.53607 |
head(partialdep_track[[2]])
| HP_phaseA | yhat |
|---|---|
| 0.0000000 | 25.37123 |
| 0.4810317 | 25.39267 |
| 0.9620634 | 25.35697 |
| 1.4430952 | 25.34941 |
| 1.9241269 | 25.34973 |
| 2.4051586 | 25.34882 |
Above, we have calculated partial dependencies only once. When we calculate N models, we can append each list element in “partialdep_track” with the function merge. Let’s put everything together:
pred_track <- as.data.frame(matrix(ncol=N+1, nrow=NROW(data)))
pred_track[,1] <- data$yield
names(pred_track) <- c("observed", paste("prediction_run", c(1:N), sep="_"))
rsquare_track <- NULL
varimp_track <- as.data.frame(matrix(ncol=N+1, nrow=length(data)-1))
varimp_track[,1] <- names(data)[-2]
names(varimp_track) <- c("variable", paste("varimp_run", c(1:N), sep="_"))
predictor_vars <- names(data)[-c(1:2)]
partialdep_track <- list()
for (y in 1:N) {
set.seed(1000+y)
data_index <- sample(nrow(data), 0.7*nrow(data), replace = FALSE)
data_train <- data[data_index,]
data_test <- data[-data_index,]
model <- randomForest(yield ~ ., data = data_train, ntree = 300, importance = TRUE)
## since we only predict a random sample of 30% of all observations, we have to put
## the predictions into the correct rows in "pred_track", in column y+1
pred <- predict(model, data_test)
pred_track[which(rownames(pred_track)%in%rownames(data_test)),y+1] <- pred
## we append the vector "rsquare_track" by one element
rsquare_track <- c(rsquare_track, cor(pred, data_test$yield))
## fill column y+1 in "varimp_track"
varimp_track[,y+1] <- importance(model, type=1)[,1]
for (p in 1:length(predictor_vars))
{ minvar <- min(data[which(names(data)%in%predictor_vars[p])])
maxvar <- max(data[which(names(data)%in%predictor_vars[p])])
df <- as.data.frame(seq(from=minvar,to=maxvar, by=(maxvar-minvar)/50))
names(df) <- predictor_vars[p]
if(y==1) {
partialdep_track[[p]] <- pdp::partial(model, pred.var = predictor_vars[p], pred.grid = df, train = data_train)
} else {
partialdep_track[[p]] <- merge(partialdep_track[[p]],
pdp::partial(model, pred.var = predictor_vars[p], pred.grid = df, train = data_train),
by = predictor_vars[p], all = T)
}
names(partialdep_track)[p] <- predictor_vars[p]
names(partialdep_track[[p]])[y+1] <- paste("iteration", y, sep = "_")
}
print(y)
}
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
## [1] 6
## [1] 7
## [1] 8
## [1] 9
## [1] 10
At this point, our model calculations are complete, and we have also tracked partial dependencies for each variable in each run:
names(partialdep_track)
## [1] "PRCP_phaseA" "HP_phaseA" "FR_phaseA" "GDD_phaseA" "PRCP_phaseB"
## [6] "TMAX_phaseB" "HP_phaseB" "FR_phaseB" "PRCP_phaseC" "DHW_phaseC"
## [11] "HP_phaseC" "FR_phaseC" "GDD_phaseC" "PRCP_phaseD" "TMAX_phaseD"
## [16] "HP_phaseD" "FR_phaseD" "GDD_phaseD" "PRCP_phaseE" "TMAX_phaseE"
## [21] "DH_phaseE" "DHW_phaseE" "NH_phaseE" "NHW_phaseE" "HP_phaseE"
## [26] "GDD_phaseE"
head(partialdep_track[[1]])
| PRCP_phaseA | iteration_1 | iteration_2 | iteration_3 | iteration_4 | iteration_5 | iteration_6 | iteration_7 | iteration_8 | iteration_9 | iteration_10 |
|---|---|---|---|---|---|---|---|---|---|---|
| 4.781048 | 26.66862 | 26.75338 | 24.81206 | 25.41225 | 25.98566 | 25.88525 | 25.94099 | 25.72438 | 26.55488 | 25.52643 |
| 8.087892 | 26.66862 | 26.75338 | 24.81206 | 25.41225 | 25.98566 | 25.88525 | 25.94099 | 25.72438 | 26.55488 | 25.52643 |
| 11.394737 | 26.66862 | 26.75338 | 24.81560 | 25.41344 | 25.98566 | 25.88840 | 25.94099 | 25.72438 | 26.55570 | 25.52643 |
| 14.701581 | 26.67115 | 26.75338 | 24.81560 | 25.41382 | 25.98566 | 25.89134 | 25.95968 | 25.74663 | 26.56113 | 25.52643 |
| 18.008425 | 26.70039 | 26.75338 | 24.83026 | 25.41970 | 25.99329 | 25.89990 | 25.96888 | 25.76416 | 26.59086 | 25.53422 |
| 21.315270 | 26.70753 | 26.75363 | 24.85779 | 25.42769 | 25.99487 | 25.90234 | 25.98309 | 25.77531 | 26.59931 | 25.53607 |
Let us summarize our results for variable importances and partial dependencies visually!
To start off with variable importances, we first have to calculate mean values from the N predictions performed above. Let us also separate the climatic parameters and phase names into separate character strings. “region” was the most important variable, but we exclude it here because it will not fit in our plot:
varimp_mean <- data.frame(variable = varimp_track$variable, mean = rowMeans(varimp_track[,c(2:N+1)]))
for (i in 1:NROW(varimp_mean))
{ varimp_mean$parameter[i] <- strsplit(varimp_mean$variable[i],split="_")[[1]][1]
varimp_mean$phase[i] <- strsplit(varimp_mean$variable[i],split="_")[[1]][2]
}
varimp_mean <- varimp_mean[-1,]
head(varimp_mean)
| variable | mean | parameter | phase | |
|---|---|---|---|---|
| 2 | PRCP_phaseA | 1.8456005 | PRCP | phaseA |
| 3 | HP_phaseA | -0.2343706 | HP | phaseA |
| 4 | FR_phaseA | 4.3839789 | FR | phaseA |
| 5 | GDD_phaseA | 1.1605075 | GDD | phaseA |
| 6 | PRCP_phaseB | 8.2105883 | PRCP | phaseB |
| 7 | TMAX_phaseB | 4.9887464 | TMAX | phaseB |
Bin the mean values into pre-defined categories:
varimp_mean$mean_factor <- cut(varimp_mean$mean,
breaks=c(-1,2.5,5,7.5,10),
labels=c("< 2.5", "2.5-5", "5-7.5", "7.5-10"))
Let us now plot the variable importance value with ggplot() and geom_tile(). Apparently, precipitation in phase B was the most important variable!
ggplot(varimp_mean,aes(x = phase , y= parameter, fill = mean_factor)) +
geom_tile(colour="white", size=0.2) +
geom_text(aes(label = round(mean, 1)), color="white", fontface="bold", size=5) +
labs(x="",y="") +
scale_fill_manual(values=rev(magma(9)[2:8]), name="%IncMSE") +
theme(axis.text.x=element_text(angle=45,hjust=1))
To plot the partial dependence, we first have to calculate mean and standard deviation across all model runs. Let us work with the data for TMAX in phase D:
partialdep_sel <- partialdep_track[[15]]
partialdep_summary <- data.frame(TMAX_phaseD = partialdep_sel[,1],
partdep_mean = apply(partialdep_sel[2:N+1], 1, mean),
partdep_sd = apply(partialdep_sel[2:N+1], 1, sd))
head(partialdep_summary)
| TMAX_phaseD | partdep_mean | partdep_sd |
|---|---|---|
| 12.76655 | 26.24412 | 0.5789103 |
| 12.99966 | 26.24412 | 0.5789103 |
| 13.23278 | 26.24180 | 0.5790403 |
| 13.46589 | 26.24002 | 0.5789699 |
| 13.69900 | 26.24839 | 0.5786900 |
| 13.93212 | 26.24812 | 0.5785040 |
We can now plot the partial dependence with ggplot():
ggplot(data = partialdep_summary, aes(x = TMAX_phaseD, y = partdep_mean)) +
geom_line(aes(), size = 1.3) +
geom_ribbon(aes(ymin = partdep_mean-partdep_sd, ymax = partdep_mean+partdep_sd), alpha=0.2) +
theme(axis.text.x=element_text(angle=45,hjust=1)) +
ggtitle("Partial dependence of yield on TMAX during Phase D") +
xlab("TMAX (°C)") +
ylab("Yield (ct/ha)")
Congratulations, you made it to the end of the second script! You can now solve the exercises 06 to 08.