Welcome to the second script of Module 5! Here, we will use Random Forest models to predict historical crop yields with climate and weather variables aggregated for different crop growth phases. You will learn how to:

assess collinearity among predictor variables
run advanced Random Forest models
create advanced visualizations of Random Forest results

Please read the following code instructions carefully, try to understand the code that follows each instruction, execute it and see what happens. Do not hesitate to change the code or write your own code, and execute it as well!

1. Data inspection

As usual, we start by loading the packages we need, and by defining our working directory:

library(randomForest)
library(ggplot2)
library(corrplot)
library(pdp)
library(viridis)

dir <- "C:/Users/Max/Desktop/IAMO_neu/eLearning/Module_5_Random_Forests/M05_Part02_Climate/"

Import the data for winter wheat. It contains one observation for each year and marz. The variable “yield” is the response variable and is based on the official statistics reported by ARMSTAT.

There are 11 predictors for each growth phase. In this case, winter wheat has five phases:

“phaseA”: vegetative phase 1 (before winter)
“phaseB”: winter dormancy
“phaseC”: vegetative phase 2
“phaseD”: reproductive phase and
“phaseE”: grain filling

(see Table 3 in the main report of WP3).

And there are the following 11 climate and weather parameters that were calculated from daily CHIRPS precipitation and ERA5-temperature data that was aggregated for the respective phase and marz:

“PRCP”: precipitation
“TAVG”: average temperature
“TMIN”: minimum temperature
“TMAX”: maximum temperature
“DH”: day heat
“DHW”: day heat waves
“NH”: night heat
“NHW”: night heat waves
“HP”: heavy precipitation
“FR”: frost
“GDD”: growing degree days

Factor levels are not saved in CSV files. Therefore, we have to convert “region” after importing the data. Let us also exclude all incomplete observations, i.e. observations that are “NA” for at least one predictor variable:

data <- read.csv(paste0(dir, "data/historical_climate/winterwheat_historical.csv"))
head(data)

region	year	yield	PRCP_phaseA	TAVG_phaseA	TMIN_phaseA	TMAX_phaseA	HP_phaseA	FR_phaseA	GDD_phaseA	PRCP_phaseB	TAVG_phaseB	TMIN_phaseB	TMAX_phaseB	HP_phaseB	FR_phaseB	PRCP_phaseC	TAVG_phaseC	TMIN_phaseC	TMAX_phaseC	DH_phaseC	DHW_phaseC	HP_phaseC	FR_phaseC	GDD_phaseC	PRCP_phaseD	TAVG_phaseD	TMIN_phaseD	TMAX_phaseD	HP_phaseD	GDD_phaseD	PRCP_phaseE	TAVG_phaseE	TMIN_phaseE	TMAX_phaseE	HP_phaseE	GDD_phaseE
Aragatsotn	2005	21.0	43.24110	8.801186	3.662232	14.40586	0.000000	-22.58002	589.6795	190.4587	-5.592452	-10.014605	-0.8960821	0.0000000	-1443.4364	107.9768	8.102834	3.011731	13.10697	0.000000	0.000000	0.0000000	-5.785085	324.1134	121.57942	14.10018	8.130387	19.71829	6.556601	719.1092	54.70011	19.22392	13.50051	24.97799	0.000000	884.3002
Aragatsotn	2006	14.2	103.51512	8.433505	3.642841	13.86310	12.640926	-24.17632	463.8428	181.4930	-2.939045	-6.931621	1.5802151	0.0000000	-1048.4495	210.0441	9.784795	4.072924	15.21274	12.331413	12.331413	6.3990479	-7.323258	474.9240	29.14831	17.96123	11.053016	24.35719	0.897646	431.0694	69.14790	18.89822	13.07337	24.69152	0.000000	907.1144
Aragatsotn	2007	25.8	170.12327	12.675672	7.566744	18.25980	7.777229	0.00000	1128.1348	208.6596	-3.551540	-7.888055	1.3182586	0.5757751	-1357.6673	143.9691	11.289824	5.381687	16.98843	10.096273	7.564598	7.1704102	-3.143651	365.3237	28.97024	15.69509	9.933802	21.31315	0.000000	470.8527	117.21213	18.01944	12.28245	23.94591	4.556953	1243.3416
Aragatsotn	2008	22.4	125.70627	8.952586	4.092550	14.32815	0.000000	-22.30723	511.8206	143.8115	-6.161019	-10.536147	-1.3126604	12.7685776	-1381.1310	190.3477	9.066595	3.042949	14.80560	4.863117	2.390143	0.0000000	-31.244697	682.0981	90.31423	15.48589	9.610766	21.01814	32.556789	464.5767	30.53244	19.24053	13.47968	25.28650	0.000000	731.1402
Aragatsotn	2009	22.9	76.48413	9.026957	4.231709	14.40760	4.741770	-28.23382	615.1244	157.4761	-3.539272	-7.738430	1.0963298	0.0000000	-1060.2846	187.0446	6.575619	1.004866	11.95773	2.232709	0.000000	6.9939194	-72.142856	418.2866	223.76986	15.43371	9.784316	20.83972	73.703217	586.4812	67.01230	16.35030	10.66225	22.11904	5.755676	833.8652
Aragatsotn	2010	20.0	142.17263	7.429945	2.704584	12.71059	4.878027	-16.40441	423.5069	331.1603	-1.032124	-4.936632	3.0436822	8.0066032	-748.4335	168.8895	8.651260	3.459903	13.61745	4.664848	0.000000	0.1802406	-6.164672	397.3530	89.16552	16.70175	10.423801	22.52617	18.590811	467.6491	74.09034	19.42425	13.25001	25.29352	0.000000	738.1215

## convert "region" to a factor variable
data$region <- as.factor(data$region)

## delete observations with NAs
data <- data[complete.cases(data),]

We are interested in the 55 climate/weather predictors, and in “region” as another predictor to account for marz-level differences (e.g. in mechanization, policies, etc.). In a regression model, you would probably define “region” as a random effect. We are not interested in “year” and assume that yearly observations are independent from one another. There are predictors that are zero for all observations, for example there is no frost in summer during phases D and E. Let us identify discard all such variables, plus the column “year”:

## identify all variables that are all zero
delete <- NULL
for (j in 1:length(data))
{ if(length(unique(data[,j]))==1) {
    if(unique(data[,j])==0) { delete <- c(delete,j) } } }

delete_vars <- names(data)[delete]
delete_vars

##  [1] "DH_phaseA"  "DHW_phaseA" "NH_phaseA"  "NHW_phaseA" "DH_phaseB" 
##  [6] "DHW_phaseB" "NH_phaseB"  "NHW_phaseB" "GDD_phaseB" "NH_phaseC" 
## [11] "NHW_phaseC" "DH_phaseD"  "DHW_phaseD" "NH_phaseD"  "NHW_phaseD"
## [16] "FR_phaseE"

## delete variables
data <- data[, which(names(data)%in% c(delete_vars, "year")==F)]

We have just reduced our dataset to 40 predictors. As all variables are derived from precipitation and temperature data, we can expect that some predictors are highly correlated. Random Forests can handle collinearity quite well, but we should still try to keep it low and reduce those variables that have the highest effects. Let’s calculate the Pearson correlation coefficient for all combinations of numerical predictors with cor(), and plot it using corrplot():

dim(data)

## [1] 160  41

corcoeffs <- cor(data[c(3:41)], method = "pearson")
corrplot(corcoeffs)

We see that there is high correlation for certain predictor combinations. We can reduce this by calculating the variance inflation factor (VIF) based on a linear regression model. To do so, we render the script “VIF.Rmd” which contains the definition of the function vif_func(), which we then apply to our dataset. This function iteratively removes one variable until all variables have a VIF below 10, which is considered a common threshold for acceptable predictor collinearity, and returns the names of those variables that can remain in the model.

We reduce our dataset to only include the variables contained in “final_vars” plus “yield” and “region”, and plot the correlation matrix one more time:

data <- data[, c(1, 2, which(names(data) %in% final_vars)),]

corrplot(cor(data[c(3:28)], method = "pearson"))

Our data set is now pre-processed and ready for the model. Let’s inspect it one more time:

dim(data)

## [1] 160  28

head(data)

region	yield	PRCP_phaseA	HP_phaseA	FR_phaseA	GDD_phaseA	PRCP_phaseB	TMAX_phaseB	HP_phaseB	FR_phaseB	PRCP_phaseC	DHW_phaseC	HP_phaseC	FR_phaseC	GDD_phaseC	PRCP_phaseD	TMAX_phaseD	HP_phaseD	GDD_phaseD	PRCP_phaseE	TMAX_phaseE	HP_phaseE	GDD_phaseE
Aragatsotn	21.0	43.24110	0.000000	-22.58002	589.6795	190.4587	-0.8960821	0.0000000	-1443.4364	107.9768	0.000000	0.0000000	-5.785085	324.1134	121.57942	19.71829	6.556601	719.1092	54.70011	24.97799	0.000000	884.3002
Aragatsotn	14.2	103.51512	12.640926	-24.17632	463.8428	181.4930	1.5802151	0.0000000	-1048.4495	210.0441	12.331413	6.3990479	-7.323258	474.9240	29.14831	24.35719	0.897646	431.0694	69.14790	24.69152	0.000000	907.1144
Aragatsotn	25.8	170.12327	7.777229	0.00000	1128.1348	208.6596	1.3182586	0.5757751	-1357.6673	143.9691	7.564598	7.1704102	-3.143651	365.3237	28.97024	21.31315	0.000000	470.8527	117.21213	23.94591	4.556953	1243.3416
Aragatsotn	22.4	125.70627	0.000000	-22.30723	511.8206	143.8115	-1.3126604	12.7685776	-1381.1310	190.3477	2.390143	0.0000000	-31.244697	682.0981	90.31423	21.01814	32.556789	464.5767	30.53244	25.28650	0.000000	731.1402
Aragatsotn	22.9	76.48413	4.741770	-28.23382	615.1244	157.4761	1.0963298	0.0000000	-1060.2846	187.0446	0.000000	6.9939194	-72.142856	418.2866	223.76986	20.83972	73.703217	586.4812	67.01230	22.11904	5.755676	833.8652
Aragatsotn	20.0	142.17263	4.878027	-16.40441	423.5069	331.1603	3.0436822	8.0066032	-748.4335	168.8895	0.000000	0.1802406	-6.164672	397.3530	89.16552	22.52617	18.590811	467.6491	74.09034	25.29352	0.000000	738.1215

str(data)

## 'data.frame':    160 obs. of  28 variables:
##  $ region     : Factor w/ 10 levels "Aragatsotn","Ararat",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ yield      : num  21 14.2 25.8 22.4 22.9 20 26.6 25.5 29 31.5 ...
##  $ PRCP_phaseA: num  43.2 103.5 170.1 125.7 76.5 ...
##  $ HP_phaseA  : num  0 12.64 7.78 0 4.74 ...
##  $ FR_phaseA  : num  -22.6 -24.2 0 -22.3 -28.2 ...
##  $ GDD_phaseA : num  590 464 1128 512 615 ...
##  $ PRCP_phaseB: num  190 181 209 144 157 ...
##  $ TMAX_phaseB: num  -0.896 1.58 1.318 -1.313 1.096 ...
##  $ HP_phaseB  : num  0 0 0.576 12.769 0 ...
##  $ FR_phaseB  : num  -1443 -1048 -1358 -1381 -1060 ...
##  $ PRCP_phaseC: num  108 210 144 190 187 ...
##  $ DHW_phaseC : num  0 12.33 7.56 2.39 0 ...
##  $ HP_phaseC  : num  0 6.4 7.17 0 6.99 ...
##  $ FR_phaseC  : num  -5.79 -7.32 -3.14 -31.24 -72.14 ...
##  $ GDD_phaseC : num  324 475 365 682 418 ...
##  $ PRCP_phaseD: num  121.6 29.1 29 90.3 223.8 ...
##  $ TMAX_phaseD: num  19.7 24.4 21.3 21 20.8 ...
##  $ HP_phaseD  : num  6.557 0.898 0 32.557 73.703 ...
##  $ FR_phaseD  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ GDD_phaseD : num  719 431 471 465 586 ...
##  $ PRCP_phaseE: num  54.7 69.1 117.2 30.5 67 ...
##  $ TMAX_phaseE: num  25 24.7 23.9 25.3 22.1 ...
##  $ DH_phaseE  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ DHW_phaseE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ NH_phaseE  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ NHW_phaseE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ HP_phaseE  : num  0 0 4.56 0 5.76 ...
##  $ GDD_phaseE : num  884 907 1243 731 834 ...

2. Run Random Forest models

Again, we first have to split our data set into training and testing. Because we have a relatively small sample size of just 160, the random sampling might have a considerable effect on the model performance. We will therefore repeat the model “N” times, and each time select different samples that we assign as training and testing data set. We then perform the model with 300 trees and the default value for mtry (= 1/3 of the total amount of predictors). We apply the model to “data_test” and calculate the variable importance with the function importance() (“type=1” means that we calculate %IncMSE).

N <- 10

for (y in 1:N) { 

  set.seed(1000+y)
        
  data_index <- sample(nrow(data), 0.7*nrow(data), replace = FALSE)
  data_train <- data[data_index,]
  data_test  <- data[-data_index,]
  
  model   <- randomForest(yield ~ ., data = data_train, ntree = 300, importance = TRUE)
  
  pred    <- predict(model, data_test)
  
  rsquare <- cor(pred, data_test$yield)

  varimp  <- importance(model, type=1)
}

In the code above, we overwrite “model”, “pred”, “rsquare” and “varimp” in each iteration, but we obviously want to save the results of all N models. Let us track the individual model predictions, R-square values and variable importances by creating some empty objects before we start the loop:

pred_track          <- as.data.frame(matrix(ncol=N+1, nrow=NROW(data)))
pred_track[,1]      <- data$yield
names(pred_track)   <- c("observed", paste("prediction_run", c(1:N), sep="_"))
  
rsquare_track       <- NULL

varimp_track        <- as.data.frame(matrix(ncol=N+1, nrow=length(data)-1))
varimp_track[,1]    <- names(data)[-2]
names(varimp_track) <- c("variable", paste("varimp_run", c(1:N), sep="_"))


for (y in 1:N) { 

  set.seed(1000+y)
        
  data_index <- sample(nrow(data), 0.7*nrow(data), replace = FALSE)
  data_train <- data[data_index,]
  data_test  <- data[-data_index,]
  
  model   <- randomForest(yield ~ ., data = data_train, ntree = 300, importance = TRUE)
  
  ## since we only predict a random sample of 30% of all observations, we have to put
  ## the predictions into the correct rows in "pred_track", in column y+1
  pred    <- predict(model, data_test)
  pred_track[which(rownames(pred_track)%in%rownames(data_test)),y+1] <- pred
  
  ## we append the vector "rsquare_track" by one element
  rsquare_track <- c(rsquare_track, cor(pred, data_test$yield))

  ## fill column y+1 in "varimp_track"
  varimp_track[,y+1] <- importance(model, type=1)[,1]
  
}

Let’s have a look at our results - we see that the results were different in each iteration!

head(pred_track)

observed	prediction_run_1	prediction_run_2	prediction_run_3	prediction_run_4	prediction_run_5	prediction_run_6	prediction_run_7	prediction_run_8	prediction_run_9	prediction_run_10
21.0	NA	24.4328	22.46219	NA	22.61581	22.95399	NA	NA	24.75172	24.56811
14.2	NA	NA	NA	NA	NA	NA	19.52664	NA	NA	NA
25.8	NA	NA	NA	NA	20.93061	NA	22.94005	NA	NA	NA
22.4	NA	NA	NA	21.14468	NA	NA	NA	NA	21.37042	NA
22.9	22.16536	NA	NA	NA	NA	NA	NA	NA	23.31842	NA
20.0	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA

rsquare_track

##  [1] 0.6514028 0.6979258 0.7276733 0.8159774 0.7732982 0.7626575 0.7631490
##  [8] 0.7667782 0.7239919 0.7610512

head(varimp_track)

variable	varimp_run_1	varimp_run_2	varimp_run_3	varimp_run_4	varimp_run_5	varimp_run_6	varimp_run_7	varimp_run_8	varimp_run_9	varimp_run_10
region	15.737528	14.906998	15.3575680	15.651557	17.2724899	19.6833368	15.5110523	17.2706946	14.5935207	16.7483408
PRCP_phaseA	3.008252	3.468303	0.7750815	2.076255	1.9137320	0.6688023	1.7306713	1.4738925	2.8369527	1.6667143
HP_phaseA	-0.958062	-1.346405	1.6947294	-1.383351	0.3668462	-1.4429276	0.2426819	-0.9143026	-0.0143967	0.6877902
FR_phaseA	3.929375	2.039395	3.9856482	6.313718	2.1238511	6.2639840	3.8621941	4.4144163	6.1980380	4.2545652
GDD_phaseA	1.932800	-0.930060	0.5856092	2.644250	-0.9637717	3.2308220	-0.6191374	2.2904607	2.8290116	1.3773832
PRCP_phaseB	7.795576	10.348723	9.7403187	6.595456	6.8714872	6.9395852	10.3231374	8.6090900	8.6221115	5.8453857

Until this point, we have not calculated partial dependencies. In the last script, we created partial dependence plots with the function partialPlot(). However, this function only returned a plot but no actual data values. As we are here running N models and want to visualize the average partial dependencies of each variable across N model runs in one plot, we have to keep track of the actual underlying partial dependence values, which we can do with the function partial() from the package pdp. The function partial() returns the partial function value for a set of given values of the predictor variable defined in the argument “pred.grid”.

Let’s define 50 values of “PRCP_phaseA” as these values, that range from the minimum to the maximum observed value for “PRCP_phaseA”. We then calculate the partial dependence with the last definitions of “model” and “data_train”, and plot the resulting prediction “yhat” as a function of the 50 defined precipitation values:

minvar     <- min(data[which(names(data) %in% "PRCP_phaseA")])
maxvar     <- max(data[which(names(data) %in% "PRCP_phaseA")])
df         <- as.data.frame(seq(from=minvar,to=maxvar, by=(maxvar-minvar)/50))
names(df)  <- "PRCP_phaseA"
partialdep <- partial(model, pred.var = "PRCP_phaseA", pred.grid = df, train = data_train)

plot(partialdep$yhat~partialdep$PRCP_phaseA, type="l", xlab = "PRCP in phase A", ylab = "Yield in dt/ha")

Let’s calculate the partial dependencies for all numerical predictor variables and save the results in a list:

predictor_vars  <- names(data)[-c(1:2)] 
partialdep_track <- list()

for (p in 1:length(predictor_vars))
{ minvar    <- min(data[which(names(data)%in%predictor_vars[p])])
  maxvar    <- max(data[which(names(data)%in%predictor_vars[p])])
  df        <- as.data.frame(seq(from=minvar,to=maxvar, by=(maxvar-minvar)/50))
  names(df) <- predictor_vars[p]
                 
  partialdep_track[[p]] <- pdp::partial(model, pred.var = predictor_vars[p], pred.grid = df, train = data_train)
}

head(partialdep_track[[1]])

PRCP_phaseA	yhat
4.781048	25.52643
8.087892	25.52643
11.394737	25.52643
14.701581	25.52643
18.008425	25.53422
21.315270	25.53607

head(partialdep_track[[2]])

HP_phaseA	yhat
0.0000000	25.37123
0.4810317	25.39267
0.9620634	25.35697
1.4430952	25.34941
1.9241269	25.34973
2.4051586	25.34882

Above, we have calculated partial dependencies only once. When we calculate N models, we can append each list element in “partialdep_track” with the function merge. Let’s put everything together:

pred_track          <- as.data.frame(matrix(ncol=N+1, nrow=NROW(data)))
pred_track[,1]      <- data$yield
names(pred_track)   <- c("observed", paste("prediction_run", c(1:N), sep="_"))
  
rsquare_track       <- NULL

varimp_track        <- as.data.frame(matrix(ncol=N+1, nrow=length(data)-1))
varimp_track[,1]    <- names(data)[-2]
names(varimp_track) <- c("variable", paste("varimp_run", c(1:N), sep="_"))     

predictor_vars  <- names(data)[-c(1:2)] 
partialdep_track <- list()
            
for (y in 1:N) { 

  set.seed(1000+y)
        
  data_index <- sample(nrow(data), 0.7*nrow(data), replace = FALSE)
  data_train <- data[data_index,]
  data_test  <- data[-data_index,]
  
  model   <- randomForest(yield ~ ., data = data_train, ntree = 300, importance = TRUE)
  
  ## since we only predict a random sample of 30% of all observations, we have to put
  ## the predictions into the correct rows in "pred_track", in column y+1
  pred    <- predict(model, data_test)
  pred_track[which(rownames(pred_track)%in%rownames(data_test)),y+1] <- pred
  
  ## we append the vector "rsquare_track" by one element
  rsquare_track <- c(rsquare_track, cor(pred, data_test$yield))

  ## fill column y+1 in "varimp_track"
  varimp_track[,y+1] <- importance(model, type=1)[,1]

  for (p in 1:length(predictor_vars))
  { minvar    <- min(data[which(names(data)%in%predictor_vars[p])])
    maxvar    <- max(data[which(names(data)%in%predictor_vars[p])])
    df        <- as.data.frame(seq(from=minvar,to=maxvar, by=(maxvar-minvar)/50))
    names(df) <- predictor_vars[p]
    
    if(y==1) {
       partialdep_track[[p]]      <- pdp::partial(model, pred.var = predictor_vars[p], pred.grid = df, train = data_train)
    } else {
       partialdep_track[[p]]      <- merge(partialdep_track[[p]], 
                                            pdp::partial(model, pred.var = predictor_vars[p], pred.grid = df, train = data_train), 
                                            by = predictor_vars[p], all = T) 
    }
    
    names(partialdep_track)[p] <- predictor_vars[p]
    names(partialdep_track[[p]])[y+1] <- paste("iteration", y, sep = "_")
  }
  print(y)
}

## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
## [1] 6
## [1] 7
## [1] 8
## [1] 9
## [1] 10

At this point, our model calculations are complete, and we have also tracked partial dependencies for each variable in each run:

names(partialdep_track)

##  [1] "PRCP_phaseA" "HP_phaseA"   "FR_phaseA"   "GDD_phaseA"  "PRCP_phaseB"
##  [6] "TMAX_phaseB" "HP_phaseB"   "FR_phaseB"   "PRCP_phaseC" "DHW_phaseC" 
## [11] "HP_phaseC"   "FR_phaseC"   "GDD_phaseC"  "PRCP_phaseD" "TMAX_phaseD"
## [16] "HP_phaseD"   "FR_phaseD"   "GDD_phaseD"  "PRCP_phaseE" "TMAX_phaseE"
## [21] "DH_phaseE"   "DHW_phaseE"  "NH_phaseE"   "NHW_phaseE"  "HP_phaseE"  
## [26] "GDD_phaseE"

head(partialdep_track[[1]])

PRCP_phaseA	iteration_1	iteration_2	iteration_3	iteration_4	iteration_5	iteration_6	iteration_7	iteration_8	iteration_9	iteration_10
4.781048	26.66862	26.75338	24.81206	25.41225	25.98566	25.88525	25.94099	25.72438	26.55488	25.52643
8.087892	26.66862	26.75338	24.81206	25.41225	25.98566	25.88525	25.94099	25.72438	26.55488	25.52643
11.394737	26.66862	26.75338	24.81560	25.41344	25.98566	25.88840	25.94099	25.72438	26.55570	25.52643
14.701581	26.67115	26.75338	24.81560	25.41382	25.98566	25.89134	25.95968	25.74663	26.56113	25.52643
18.008425	26.70039	26.75338	24.83026	25.41970	25.99329	25.89990	25.96888	25.76416	26.59086	25.53422
21.315270	26.70753	26.75363	24.85779	25.42769	25.99487	25.90234	25.98309	25.77531	26.59931	25.53607

3. Visualize model results

Let us summarize our results for variable importances and partial dependencies visually!

To start off with variable importances, we first have to calculate mean values from the N predictions performed above. Let us also separate the climatic parameters and phase names into separate character strings. “region” was the most important variable, but we exclude it here because it will not fit in our plot:

varimp_mean <- data.frame(variable = varimp_track$variable, mean = rowMeans(varimp_track[,c(2:N+1)]))
for (i in 1:NROW(varimp_mean))
{ varimp_mean$parameter[i] <- strsplit(varimp_mean$variable[i],split="_")[[1]][1]
  varimp_mean$phase[i]     <- strsplit(varimp_mean$variable[i],split="_")[[1]][2]
}

varimp_mean <- varimp_mean[-1,]

head(varimp_mean)

	variable	mean	parameter	phase
2	PRCP_phaseA	1.8456005	PRCP	phaseA
3	HP_phaseA	-0.2343706	HP	phaseA
4	FR_phaseA	4.3839789	FR	phaseA
5	GDD_phaseA	1.1605075	GDD	phaseA
6	PRCP_phaseB	8.2105883	PRCP	phaseB
7	TMAX_phaseB	4.9887464	TMAX	phaseB

Bin the mean values into pre-defined categories:

varimp_mean$mean_factor <- cut(varimp_mean$mean,
                               breaks=c(-1,2.5,5,7.5,10),
                               labels=c("< 2.5", "2.5-5", "5-7.5", "7.5-10"))

Let us now plot the variable importance value with ggplot() and geom_tile(). Apparently, precipitation in phase B was the most important variable!

ggplot(varimp_mean,aes(x = phase , y= parameter, fill = mean_factor)) +
               geom_tile(colour="white", size=0.2) + 
               geom_text(aes(label = round(mean, 1)), color="white", fontface="bold", size=5) +
               labs(x="",y="") +
               scale_fill_manual(values=rev(magma(9)[2:8]), name="%IncMSE") +
               theme(axis.text.x=element_text(angle=45,hjust=1))

To plot the partial dependence, we first have to calculate mean and standard deviation across all model runs. Let us work with the data for TMAX in phase D:

partialdep_sel <- partialdep_track[[15]]

partialdep_summary <- data.frame(TMAX_phaseD = partialdep_sel[,1], 
                                 partdep_mean = apply(partialdep_sel[2:N+1], 1, mean),
                                 partdep_sd   = apply(partialdep_sel[2:N+1], 1, sd))

head(partialdep_summary)

TMAX_phaseD	partdep_mean	partdep_sd
12.76655	26.24412	0.5789103
12.99966	26.24412	0.5789103
13.23278	26.24180	0.5790403
13.46589	26.24002	0.5789699
13.69900	26.24839	0.5786900
13.93212	26.24812	0.5785040

We can now plot the partial dependence with ggplot():

ggplot(data = partialdep_summary, aes(x = TMAX_phaseD, y = partdep_mean)) + 
                        geom_line(aes(), size = 1.3) +
                        geom_ribbon(aes(ymin = partdep_mean-partdep_sd, ymax = partdep_mean+partdep_sd), alpha=0.2) +
                        theme(axis.text.x=element_text(angle=45,hjust=1)) +
                        ggtitle("Partial dependence of yield on TMAX during Phase D") +
                        xlab("TMAX (°C)") +
                        ylab("Yield (ct/ha)")

Congratulations, you made it to the end of the second script! You can now solve the exercises 06 to 08.

Module 5: Random Forest Models
Script 02: Climate

Max Hofmann

1. Data inspection

2. Run Random Forest models

3. Visualize model results

Module 5: Random Forest Models Script 02: Climate

Max Hofmann

1. Data inspection

2. Run Random Forest models

3. Visualize model results

Module 5: Random Forest Models
Script 02: Climate