Welcome to the second script of Module 5! Here, we will use Random Forest models to predict historical crop yields with climate and weather variables aggregated for different crop growth phases. You will learn how to:

  • assess collinearity among predictor variables
  • run advanced Random Forest models
  • create advanced visualizations of Random Forest results

Please read the following code instructions carefully, try to understand the code that follows each instruction, execute it and see what happens. Do not hesitate to change the code or write your own code, and execute it as well!


1. Data inspection

As usual, we start by loading the packages we need, and by defining our working directory:

library(randomForest)
library(ggplot2)
library(corrplot)
library(pdp)
library(viridis)

dir <- "C:/Users/Max/Desktop/IAMO_neu/eLearning/Module_5_Random_Forests/M05_Part02_Climate/"

Import the data for winter wheat. It contains one observation for each year and marz. The variable “yield” is the response variable and is based on the official statistics reported by ARMSTAT.

There are 11 predictors for each growth phase. In this case, winter wheat has five phases:

  • “phaseA”: vegetative phase 1 (before winter)
  • “phaseB”: winter dormancy
  • “phaseC”: vegetative phase 2
  • “phaseD”: reproductive phase and
  • “phaseE”: grain filling

(see Table 3 in the main report of WP3).

And there are the following 11 climate and weather parameters that were calculated from daily CHIRPS precipitation and ERA5-temperature data that was aggregated for the respective phase and marz:

  • “PRCP”: precipitation
  • “TAVG”: average temperature
  • “TMIN”: minimum temperature
  • “TMAX”: maximum temperature
  • “DH”: day heat
  • “DHW”: day heat waves
  • “NH”: night heat
  • “NHW”: night heat waves
  • “HP”: heavy precipitation
  • “FR”: frost
  • “GDD”: growing degree days

Factor levels are not saved in CSV files. Therefore, we have to convert “region” after importing the data. Let us also exclude all incomplete observations, i.e. observations that are “NA” for at least one predictor variable:

data <- read.csv(paste0(dir, "data/historical_climate/winterwheat_historical.csv"))
head(data)
region year yield PRCP_phaseA TAVG_phaseA TMIN_phaseA TMAX_phaseA DH_phaseA DHW_phaseA NH_phaseA NHW_phaseA HP_phaseA FR_phaseA GDD_phaseA PRCP_phaseB TAVG_phaseB TMIN_phaseB TMAX_phaseB DH_phaseB DHW_phaseB NH_phaseB NHW_phaseB HP_phaseB FR_phaseB GDD_phaseB PRCP_phaseC TAVG_phaseC TMIN_phaseC TMAX_phaseC DH_phaseC DHW_phaseC NH_phaseC NHW_phaseC HP_phaseC FR_phaseC GDD_phaseC PRCP_phaseD TAVG_phaseD TMIN_phaseD TMAX_phaseD DH_phaseD DHW_phaseD NH_phaseD NHW_phaseD HP_phaseD FR_phaseD GDD_phaseD PRCP_phaseE TAVG_phaseE TMIN_phaseE TMAX_phaseE DH_phaseE DHW_phaseE NH_phaseE NHW_phaseE HP_phaseE FR_phaseE GDD_phaseE
Aragatsotn 2005 21.0 43.24110 8.801186 3.662232 14.40586 0 0 0 0 0.000000 -22.58002 589.6795 190.4587 -5.592452 -10.014605 -0.8960821 0 0 0 0 0.0000000 -1443.4364 0 107.9768 8.102834 3.011731 13.10697 0.000000 0.000000 0 0 0.0000000 -5.785085 324.1134 121.57942 14.10018 8.130387 19.71829 0 0 0 0 6.556601 0 719.1092 54.70011 19.22392 13.50051 24.97799 0 0 0 0 0.000000 0 884.3002
Aragatsotn 2006 14.2 103.51512 8.433505 3.642841 13.86310 0 0 0 0 12.640926 -24.17632 463.8428 181.4930 -2.939045 -6.931621 1.5802151 0 0 0 0 0.0000000 -1048.4495 0 210.0441 9.784795 4.072924 15.21274 12.331413 12.331413 0 0 6.3990479 -7.323258 474.9240 29.14831 17.96123 11.053016 24.35719 0 0 0 0 0.897646 0 431.0694 69.14790 18.89822 13.07337 24.69152 0 0 0 0 0.000000 0 907.1144
Aragatsotn 2007 25.8 170.12327 12.675672 7.566744 18.25980 0 0 0 0 7.777229 0.00000 1128.1348 208.6596 -3.551540 -7.888055 1.3182586 0 0 0 0 0.5757751 -1357.6673 0 143.9691 11.289824 5.381687 16.98843 10.096273 7.564598 0 0 7.1704102 -3.143651 365.3237 28.97024 15.69509 9.933802 21.31315 0 0 0 0 0.000000 0 470.8527 117.21213 18.01944 12.28245 23.94591 0 0 0 0 4.556953 0 1243.3416
Aragatsotn 2008 22.4 125.70627 8.952586 4.092550 14.32815 0 0 0 0 0.000000 -22.30723 511.8206 143.8115 -6.161019 -10.536147 -1.3126604 0 0 0 0 12.7685776 -1381.1310 0 190.3477 9.066595 3.042949 14.80560 4.863117 2.390143 0 0 0.0000000 -31.244697 682.0981 90.31423 15.48589 9.610766 21.01814 0 0 0 0 32.556789 0 464.5767 30.53244 19.24053 13.47968 25.28650 0 0 0 0 0.000000 0 731.1402
Aragatsotn 2009 22.9 76.48413 9.026957 4.231709 14.40760 0 0 0 0 4.741770 -28.23382 615.1244 157.4761 -3.539272 -7.738430 1.0963298 0 0 0 0 0.0000000 -1060.2846 0 187.0446 6.575619 1.004866 11.95773 2.232709 0.000000 0 0 6.9939194 -72.142856 418.2866 223.76986 15.43371 9.784316 20.83972 0 0 0 0 73.703217 0 586.4812 67.01230 16.35030 10.66225 22.11904 0 0 0 0 5.755676 0 833.8652
Aragatsotn 2010 20.0 142.17263 7.429945 2.704584 12.71059 0 0 0 0 4.878027 -16.40441 423.5069 331.1603 -1.032124 -4.936632 3.0436822 0 0 0 0 8.0066032 -748.4335 0 168.8895 8.651260 3.459903 13.61745 4.664848 0.000000 0 0 0.1802406 -6.164672 397.3530 89.16552 16.70175 10.423801 22.52617 0 0 0 0 18.590811 0 467.6491 74.09034 19.42425 13.25001 25.29352 0 0 0 0 0.000000 0 738.1215
## convert "region" to a factor variable
data$region <- as.factor(data$region)

## delete observations with NAs
data <- data[complete.cases(data),]

We are interested in the 55 climate/weather predictors, and in “region” as another predictor to account for marz-level differences (e.g. in mechanization, policies, etc.). In a regression model, you would probably define “region” as a random effect. We are not interested in “year” and assume that yearly observations are independent from one another. There are predictors that are zero for all observations, for example there is no frost in summer during phases D and E. Let us identify discard all such variables, plus the column “year”:

## identify all variables that are all zero
delete <- NULL
for (j in 1:length(data))
{ if(length(unique(data[,j]))==1) {
    if(unique(data[,j])==0) { delete <- c(delete,j) } } }

delete_vars <- names(data)[delete]
delete_vars
##  [1] "DH_phaseA"  "DHW_phaseA" "NH_phaseA"  "NHW_phaseA" "DH_phaseB" 
##  [6] "DHW_phaseB" "NH_phaseB"  "NHW_phaseB" "GDD_phaseB" "NH_phaseC" 
## [11] "NHW_phaseC" "DH_phaseD"  "DHW_phaseD" "NH_phaseD"  "NHW_phaseD"
## [16] "FR_phaseE"
## delete variables
data <- data[, which(names(data)%in% c(delete_vars, "year")==F)]

We have just reduced our dataset to 40 predictors. As all variables are derived from precipitation and temperature data, we can expect that some predictors are highly correlated. Random Forests can handle collinearity quite well, but we should still try to keep it low and reduce those variables that have the highest effects. Let’s calculate the Pearson correlation coefficient for all combinations of numerical predictors with cor(), and plot it using corrplot():

dim(data)
## [1] 160  41
corcoeffs <- cor(data[c(3:41)], method = "pearson")
corrplot(corcoeffs)

We see that there is high correlation for certain predictor combinations. We can reduce this by calculating the variance inflation factor (VIF) based on a linear regression model. To do so, we render the script “VIF.Rmd” which contains the definition of the function vif_func(), which we then apply to our dataset. This function iteratively removes one variable until all variables have a VIF below 10, which is considered a common threshold for acceptable predictor collinearity, and returns the names of those variables that can remain in the model.

We reduce our dataset to only include the variables contained in “final_vars” plus “yield” and “region”, and plot the correlation matrix one more time:

data <- data[, c(1, 2, which(names(data) %in% final_vars)),]

corrplot(cor(data[c(3:28)], method = "pearson"))

Our data set is now pre-processed and ready for the model. Let’s inspect it one more time:

dim(data)
## [1] 160  28
head(data)
region yield PRCP_phaseA HP_phaseA FR_phaseA GDD_phaseA PRCP_phaseB TMAX_phaseB HP_phaseB FR_phaseB PRCP_phaseC DHW_phaseC HP_phaseC FR_phaseC GDD_phaseC PRCP_phaseD TMAX_phaseD HP_phaseD FR_phaseD GDD_phaseD PRCP_phaseE TMAX_phaseE DH_phaseE DHW_phaseE NH_phaseE NHW_phaseE HP_phaseE GDD_phaseE
Aragatsotn 21.0 43.24110 0.000000 -22.58002 589.6795 190.4587 -0.8960821 0.0000000 -1443.4364 107.9768 0.000000 0.0000000 -5.785085 324.1134 121.57942 19.71829 6.556601 0 719.1092 54.70011 24.97799 0 0 0 0 0.000000 884.3002
Aragatsotn 14.2 103.51512 12.640926 -24.17632 463.8428 181.4930 1.5802151 0.0000000 -1048.4495 210.0441 12.331413 6.3990479 -7.323258 474.9240 29.14831 24.35719 0.897646 0 431.0694 69.14790 24.69152 0 0 0 0 0.000000 907.1144
Aragatsotn 25.8 170.12327 7.777229 0.00000 1128.1348 208.6596 1.3182586 0.5757751 -1357.6673 143.9691 7.564598 7.1704102 -3.143651 365.3237 28.97024 21.31315 0.000000 0 470.8527 117.21213 23.94591 0 0 0 0 4.556953 1243.3416
Aragatsotn 22.4 125.70627 0.000000 -22.30723 511.8206 143.8115 -1.3126604 12.7685776 -1381.1310 190.3477 2.390143 0.0000000 -31.244697 682.0981 90.31423 21.01814 32.556789 0 464.5767 30.53244 25.28650 0 0 0 0 0.000000 731.1402
Aragatsotn 22.9 76.48413 4.741770 -28.23382 615.1244 157.4761 1.0963298 0.0000000 -1060.2846 187.0446 0.000000 6.9939194 -72.142856 418.2866 223.76986 20.83972 73.703217 0 586.4812 67.01230 22.11904 0 0 0 0 5.755676 833.8652
Aragatsotn 20.0 142.17263 4.878027 -16.40441 423.5069 331.1603 3.0436822 8.0066032 -748.4335 168.8895 0.000000 0.1802406 -6.164672 397.3530 89.16552 22.52617 18.590811 0 467.6491 74.09034 25.29352 0 0 0 0 0.000000 738.1215
str(data)
## 'data.frame':    160 obs. of  28 variables:
##  $ region     : Factor w/ 10 levels "Aragatsotn","Ararat",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ yield      : num  21 14.2 25.8 22.4 22.9 20 26.6 25.5 29 31.5 ...
##  $ PRCP_phaseA: num  43.2 103.5 170.1 125.7 76.5 ...
##  $ HP_phaseA  : num  0 12.64 7.78 0 4.74 ...
##  $ FR_phaseA  : num  -22.6 -24.2 0 -22.3 -28.2 ...
##  $ GDD_phaseA : num  590 464 1128 512 615 ...
##  $ PRCP_phaseB: num  190 181 209 144 157 ...
##  $ TMAX_phaseB: num  -0.896 1.58 1.318 -1.313 1.096 ...
##  $ HP_phaseB  : num  0 0 0.576 12.769 0 ...
##  $ FR_phaseB  : num  -1443 -1048 -1358 -1381 -1060 ...
##  $ PRCP_phaseC: num  108 210 144 190 187 ...
##  $ DHW_phaseC : num  0 12.33 7.56 2.39 0 ...
##  $ HP_phaseC  : num  0 6.4 7.17 0 6.99 ...
##  $ FR_phaseC  : num  -5.79 -7.32 -3.14 -31.24 -72.14 ...
##  $ GDD_phaseC : num  324 475 365 682 418 ...
##  $ PRCP_phaseD: num  121.6 29.1 29 90.3 223.8 ...
##  $ TMAX_phaseD: num  19.7 24.4 21.3 21 20.8 ...
##  $ HP_phaseD  : num  6.557 0.898 0 32.557 73.703 ...
##  $ FR_phaseD  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ GDD_phaseD : num  719 431 471 465 586 ...
##  $ PRCP_phaseE: num  54.7 69.1 117.2 30.5 67 ...
##  $ TMAX_phaseE: num  25 24.7 23.9 25.3 22.1 ...
##  $ DH_phaseE  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ DHW_phaseE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ NH_phaseE  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ NHW_phaseE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ HP_phaseE  : num  0 0 4.56 0 5.76 ...
##  $ GDD_phaseE : num  884 907 1243 731 834 ...

2. Run Random Forest models

Again, we first have to split our data set into training and testing. Because we have a relatively small sample size of just 160, the random sampling might have a considerable effect on the model performance. We will therefore repeat the model “N” times, and each time select different samples that we assign as training and testing data set. We then perform the model with 300 trees and the default value for mtry (= 1/3 of the total amount of predictors). We apply the model to “data_test” and calculate the variable importance with the function importance() (“type=1” means that we calculate %IncMSE).

N <- 10

for (y in 1:N) { 

  set.seed(1000+y)
        
  data_index <- sample(nrow(data), 0.7*nrow(data), replace = FALSE)
  data_train <- data[data_index,]
  data_test  <- data[-data_index,]
  
  model   <- randomForest(yield ~ ., data = data_train, ntree = 300, importance = TRUE)
  
  pred    <- predict(model, data_test)
  
  rsquare <- cor(pred, data_test$yield)

  varimp  <- importance(model, type=1)
}

In the code above, we overwrite “model”, “pred”, “rsquare” and “varimp” in each iteration, but we obviously want to save the results of all N models. Let us track the individual model predictions, R-square values and variable importances by creating some empty objects before we start the loop:

pred_track          <- as.data.frame(matrix(ncol=N+1, nrow=NROW(data)))
pred_track[,1]      <- data$yield
names(pred_track)   <- c("observed", paste("prediction_run", c(1:N), sep="_"))
  
rsquare_track       <- NULL

varimp_track        <- as.data.frame(matrix(ncol=N+1, nrow=length(data)-1))
varimp_track[,1]    <- names(data)[-2]
names(varimp_track) <- c("variable", paste("varimp_run", c(1:N), sep="_"))


for (y in 1:N) { 

  set.seed(1000+y)
        
  data_index <- sample(nrow(data), 0.7*nrow(data), replace = FALSE)
  data_train <- data[data_index,]
  data_test  <- data[-data_index,]
  
  model   <- randomForest(yield ~ ., data = data_train, ntree = 300, importance = TRUE)
  
  ## since we only predict a random sample of 30% of all observations, we have to put
  ## the predictions into the correct rows in "pred_track", in column y+1
  pred    <- predict(model, data_test)
  pred_track[which(rownames(pred_track)%in%rownames(data_test)),y+1] <- pred
  
  ## we append the vector "rsquare_track" by one element
  rsquare_track <- c(rsquare_track, cor(pred, data_test$yield))

  ## fill column y+1 in "varimp_track"
  varimp_track[,y+1] <- importance(model, type=1)[,1]
  
}

Let’s have a look at our results - we see that the results were different in each iteration!

head(pred_track)
observed prediction_run_1 prediction_run_2 prediction_run_3 prediction_run_4 prediction_run_5 prediction_run_6 prediction_run_7 prediction_run_8 prediction_run_9 prediction_run_10
21.0 NA 24.4328 22.46219 NA 22.61581 22.95399 NA NA 24.75172 24.56811
14.2 NA NA NA NA NA NA 19.52664 NA NA NA
25.8 NA NA NA NA 20.93061 NA 22.94005 NA NA NA
22.4 NA NA NA 21.14468 NA NA NA NA 21.37042 NA
22.9 22.16536 NA NA NA NA NA NA NA 23.31842 NA
20.0 NA NA NA NA NA NA NA NA NA NA
rsquare_track
##  [1] 0.6514028 0.6979258 0.7276733 0.8159774 0.7732982 0.7626575 0.7631490
##  [8] 0.7667782 0.7239919 0.7610512
head(varimp_track)
variable varimp_run_1 varimp_run_2 varimp_run_3 varimp_run_4 varimp_run_5 varimp_run_6 varimp_run_7 varimp_run_8 varimp_run_9 varimp_run_10
region 15.737528 14.906998 15.3575680 15.651557 17.2724899 19.6833368 15.5110523 17.2706946 14.5935207 16.7483408
PRCP_phaseA 3.008252 3.468303 0.7750815 2.076255 1.9137320 0.6688023 1.7306713 1.4738925 2.8369527 1.6667143
HP_phaseA -0.958062 -1.346405 1.6947294 -1.383351 0.3668462 -1.4429276 0.2426819 -0.9143026 -0.0143967 0.6877902
FR_phaseA 3.929375 2.039395 3.9856482 6.313718 2.1238511 6.2639840 3.8621941 4.4144163 6.1980380 4.2545652
GDD_phaseA 1.932800 -0.930060 0.5856092 2.644250 -0.9637717 3.2308220 -0.6191374 2.2904607 2.8290116 1.3773832
PRCP_phaseB 7.795576 10.348723 9.7403187 6.595456 6.8714872 6.9395852 10.3231374 8.6090900 8.6221115 5.8453857

Until this point, we have not calculated partial dependencies. In the last script, we created partial dependence plots with the function partialPlot(). However, this function only returned a plot but no actual data values. As we are here running N models and want to visualize the average partial dependencies of each variable across N model runs in one plot, we have to keep track of the actual underlying partial dependence values, which we can do with the function partial() from the package pdp. The function partial() returns the partial function value for a set of given values of the predictor variable defined in the argument “pred.grid”.

Let’s define 50 values of “PRCP_phaseA” as these values, that range from the minimum to the maximum observed value for “PRCP_phaseA”. We then calculate the partial dependence with the last definitions of “model” and “data_train”, and plot the resulting prediction “yhat” as a function of the 50 defined precipitation values:

minvar     <- min(data[which(names(data) %in% "PRCP_phaseA")])
maxvar     <- max(data[which(names(data) %in% "PRCP_phaseA")])
df         <- as.data.frame(seq(from=minvar,to=maxvar, by=(maxvar-minvar)/50))
names(df)  <- "PRCP_phaseA"
partialdep <- partial(model, pred.var = "PRCP_phaseA", pred.grid = df, train = data_train)

plot(partialdep$yhat~partialdep$PRCP_phaseA, type="l", xlab = "PRCP in phase A", ylab = "Yield in dt/ha")

Let’s calculate the partial dependencies for all numerical predictor variables and save the results in a list:

predictor_vars  <- names(data)[-c(1:2)] 
partialdep_track <- list()

for (p in 1:length(predictor_vars))
{ minvar    <- min(data[which(names(data)%in%predictor_vars[p])])
  maxvar    <- max(data[which(names(data)%in%predictor_vars[p])])
  df        <- as.data.frame(seq(from=minvar,to=maxvar, by=(maxvar-minvar)/50))
  names(df) <- predictor_vars[p]
                 
  partialdep_track[[p]] <- pdp::partial(model, pred.var = predictor_vars[p], pred.grid = df, train = data_train)
}

head(partialdep_track[[1]])
PRCP_phaseA yhat
4.781048 25.52643
8.087892 25.52643
11.394737 25.52643
14.701581 25.52643
18.008425 25.53422
21.315270 25.53607
head(partialdep_track[[2]])
HP_phaseA yhat
0.0000000 25.37123
0.4810317 25.39267
0.9620634 25.35697
1.4430952 25.34941
1.9241269 25.34973
2.4051586 25.34882

Above, we have calculated partial dependencies only once. When we calculate N models, we can append each list element in “partialdep_track” with the function merge. Let’s put everything together:

pred_track          <- as.data.frame(matrix(ncol=N+1, nrow=NROW(data)))
pred_track[,1]      <- data$yield
names(pred_track)   <- c("observed", paste("prediction_run", c(1:N), sep="_"))
  
rsquare_track       <- NULL

varimp_track        <- as.data.frame(matrix(ncol=N+1, nrow=length(data)-1))
varimp_track[,1]    <- names(data)[-2]
names(varimp_track) <- c("variable", paste("varimp_run", c(1:N), sep="_"))     

predictor_vars  <- names(data)[-c(1:2)] 
partialdep_track <- list()
            
for (y in 1:N) { 

  set.seed(1000+y)
        
  data_index <- sample(nrow(data), 0.7*nrow(data), replace = FALSE)
  data_train <- data[data_index,]
  data_test  <- data[-data_index,]
  
  model   <- randomForest(yield ~ ., data = data_train, ntree = 300, importance = TRUE)
  
  ## since we only predict a random sample of 30% of all observations, we have to put
  ## the predictions into the correct rows in "pred_track", in column y+1
  pred    <- predict(model, data_test)
  pred_track[which(rownames(pred_track)%in%rownames(data_test)),y+1] <- pred
  
  ## we append the vector "rsquare_track" by one element
  rsquare_track <- c(rsquare_track, cor(pred, data_test$yield))

  ## fill column y+1 in "varimp_track"
  varimp_track[,y+1] <- importance(model, type=1)[,1]

  for (p in 1:length(predictor_vars))
  { minvar    <- min(data[which(names(data)%in%predictor_vars[p])])
    maxvar    <- max(data[which(names(data)%in%predictor_vars[p])])
    df        <- as.data.frame(seq(from=minvar,to=maxvar, by=(maxvar-minvar)/50))
    names(df) <- predictor_vars[p]
    
    if(y==1) {
       partialdep_track[[p]]      <- pdp::partial(model, pred.var = predictor_vars[p], pred.grid = df, train = data_train)
    } else {
       partialdep_track[[p]]      <- merge(partialdep_track[[p]], 
                                            pdp::partial(model, pred.var = predictor_vars[p], pred.grid = df, train = data_train), 
                                            by = predictor_vars[p], all = T) 
    }
    
    names(partialdep_track)[p] <- predictor_vars[p]
    names(partialdep_track[[p]])[y+1] <- paste("iteration", y, sep = "_")
  }
  print(y)
}
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
## [1] 6
## [1] 7
## [1] 8
## [1] 9
## [1] 10

At this point, our model calculations are complete, and we have also tracked partial dependencies for each variable in each run:

names(partialdep_track)
##  [1] "PRCP_phaseA" "HP_phaseA"   "FR_phaseA"   "GDD_phaseA"  "PRCP_phaseB"
##  [6] "TMAX_phaseB" "HP_phaseB"   "FR_phaseB"   "PRCP_phaseC" "DHW_phaseC" 
## [11] "HP_phaseC"   "FR_phaseC"   "GDD_phaseC"  "PRCP_phaseD" "TMAX_phaseD"
## [16] "HP_phaseD"   "FR_phaseD"   "GDD_phaseD"  "PRCP_phaseE" "TMAX_phaseE"
## [21] "DH_phaseE"   "DHW_phaseE"  "NH_phaseE"   "NHW_phaseE"  "HP_phaseE"  
## [26] "GDD_phaseE"
head(partialdep_track[[1]])
PRCP_phaseA iteration_1 iteration_2 iteration_3 iteration_4 iteration_5 iteration_6 iteration_7 iteration_8 iteration_9 iteration_10
4.781048 26.66862 26.75338 24.81206 25.41225 25.98566 25.88525 25.94099 25.72438 26.55488 25.52643
8.087892 26.66862 26.75338 24.81206 25.41225 25.98566 25.88525 25.94099 25.72438 26.55488 25.52643
11.394737 26.66862 26.75338 24.81560 25.41344 25.98566 25.88840 25.94099 25.72438 26.55570 25.52643
14.701581 26.67115 26.75338 24.81560 25.41382 25.98566 25.89134 25.95968 25.74663 26.56113 25.52643
18.008425 26.70039 26.75338 24.83026 25.41970 25.99329 25.89990 25.96888 25.76416 26.59086 25.53422
21.315270 26.70753 26.75363 24.85779 25.42769 25.99487 25.90234 25.98309 25.77531 26.59931 25.53607

3. Visualize model results

Let us summarize our results for variable importances and partial dependencies visually!

To start off with variable importances, we first have to calculate mean values from the N predictions performed above. Let us also separate the climatic parameters and phase names into separate character strings. “region” was the most important variable, but we exclude it here because it will not fit in our plot:

varimp_mean <- data.frame(variable = varimp_track$variable, mean = rowMeans(varimp_track[,c(2:N+1)]))
for (i in 1:NROW(varimp_mean))
{ varimp_mean$parameter[i] <- strsplit(varimp_mean$variable[i],split="_")[[1]][1]
  varimp_mean$phase[i]     <- strsplit(varimp_mean$variable[i],split="_")[[1]][2]
}

varimp_mean <- varimp_mean[-1,]

head(varimp_mean)
variable mean parameter phase
2 PRCP_phaseA 1.8456005 PRCP phaseA
3 HP_phaseA -0.2343706 HP phaseA
4 FR_phaseA 4.3839789 FR phaseA
5 GDD_phaseA 1.1605075 GDD phaseA
6 PRCP_phaseB 8.2105883 PRCP phaseB
7 TMAX_phaseB 4.9887464 TMAX phaseB

Bin the mean values into pre-defined categories:

varimp_mean$mean_factor <- cut(varimp_mean$mean,
                               breaks=c(-1,2.5,5,7.5,10),
                               labels=c("< 2.5", "2.5-5", "5-7.5", "7.5-10"))

Let us now plot the variable importance value with ggplot() and geom_tile(). Apparently, precipitation in phase B was the most important variable!

ggplot(varimp_mean,aes(x = phase , y= parameter, fill = mean_factor)) +
               geom_tile(colour="white", size=0.2) + 
               geom_text(aes(label = round(mean, 1)), color="white", fontface="bold", size=5) +
               labs(x="",y="") +
               scale_fill_manual(values=rev(magma(9)[2:8]), name="%IncMSE") +
               theme(axis.text.x=element_text(angle=45,hjust=1))

To plot the partial dependence, we first have to calculate mean and standard deviation across all model runs. Let us work with the data for TMAX in phase D:

partialdep_sel <- partialdep_track[[15]]

partialdep_summary <- data.frame(TMAX_phaseD = partialdep_sel[,1], 
                                 partdep_mean = apply(partialdep_sel[2:N+1], 1, mean),
                                 partdep_sd   = apply(partialdep_sel[2:N+1], 1, sd))

head(partialdep_summary)
TMAX_phaseD partdep_mean partdep_sd
12.76655 26.24412 0.5789103
12.99966 26.24412 0.5789103
13.23278 26.24180 0.5790403
13.46589 26.24002 0.5789699
13.69900 26.24839 0.5786900
13.93212 26.24812 0.5785040

We can now plot the partial dependence with ggplot():

ggplot(data = partialdep_summary, aes(x = TMAX_phaseD, y = partdep_mean)) + 
                        geom_line(aes(), size = 1.3) +
                        geom_ribbon(aes(ymin = partdep_mean-partdep_sd, ymax = partdep_mean+partdep_sd), alpha=0.2) +
                        theme(axis.text.x=element_text(angle=45,hjust=1)) +
                        ggtitle("Partial dependence of yield on TMAX during Phase D") +
                        xlab("TMAX (°C)") +
                        ylab("Yield (ct/ha)")  

Congratulations, you made it to the end of the second script! You can now solve the exercises 06 to 08.