Welcome to the exercises of module 5! Please try to solve the following exercises, and if you do not remember how to write the code for certain things, have a look again at the previous scripts. Programming is about learning by doing! So, do not hesitate to look for help in the internet, too. Remember that there is almost always more than just one way to solve a problem in R!

Part A: The wine data set

Exercise 01 (10 points)

Import the wine data set and split it into a training (70%) and testing (30%) data set. Create a random forest model trained with the training data set, that predicts the parameter assigned to your group, using ntree = 500 and mtry = 4.

Exercise 02 (10 points)

Apply your trained model to the testing data set and report the R-Square between observed and predicted values. Change your model four times using different values for “ntree” and “mtry” and report the changes in R-Square between observed and predicted values.

Exercise 03 (10 points)

What are the four most important variables in your model, according to the variable “%IncMSE”? Create partial dependence plots for these four variables, and export them as PNGs.

Part B: Your own data set

Exercise 04 (20 points)

Use your own data set (see specifications in the Slack channel) for a random forest classification or regression. Briefly explain in text form what your data set is about, what the variables mean, and what your response variable is. Split the data set into training and testing, and apply the trained model to the testing data set. In case your response variable is categorical, please export the confusion matrix between observed and predicted values as CSV; if the response variable is numeric, please plot the observed against the predicted values in a scatter plot and export it as PNG.

Exercise 05 (10 points)

What are the most important variables in your model? Does the variable ranking meet your expectations? Please plot the partial dependencies that you find most interesting, and export them as PNGs. Please explain the reasons that could lay behind the patterns in these plots.

Part C: Climate data set

Exercise 06 (10 points)

Predicting historical yield

Load the historical data of the crop assigned to your group from the folder “historical_climate”. Delete all incomplete observations (i.e. rows where there is any “NA” value).
Remove all predictor variables that are zero for all observations, and that have a Variance Inflation Factor of above 10. Report the names of all variables that you excluded.
Train and test a random forest model 20 times, each time drawing different samples for training (70%) and testing (30%). Save the variable importances and the partial dependencies of each model run. Export the variable importances and partial dependencies as RDS-files using the function saveRDS().

Exercise 07 (10 points)

Plot model results

Plot the mean variable importances across all 20 model runs with ggplot() and geom_tile(). Save your plot as PNG.
For the 4 most important variables, calculate the mean and standard deviation of their partial dependencies across all 20 model runs. Plot the partial dependencies of these 4 variables, arrange the 4 resulting plots in one window using grid.arrange(), and save your plot as PNG.

Exercise 08 (20 points)

Predicting future yield

Train a model with the historical data to predict future yields by using the future climatic data from the folder “future_climate” as testing data set. Do this 10 times, and for each combination of RCP and future time period separately. Save the resulting 10 predictions of each combination in 4 separate data frames, and export these data frames as CSV files.
For each marz, calculate long-year historical yields based on the yearly yield values in the folder “historical_yields”.
From the future yield predictions and the long-year historical yield, calculate for each marz and each combination of RCP and future time period the predicted in- or decrease in yield in percent. Plot the results in a multipart-map using ggplot(), and export it as PNG. This figure should resemble the one from the main report of WP4.

Module 5: Random Forest Models
Exercises: Assignment

Max Hofmann

Part A: The wine data set

Exercise 01 (10 points)

Exercise 02 (10 points)

Exercise 03 (10 points)

Part B: Your own data set

Exercise 04 (20 points)

Exercise 05 (10 points)

Part C: Climate data set

Exercise 06 (10 points)

Exercise 07 (10 points)

Exercise 08 (20 points)

Module 5: Random Forest Models Exercises: Assignment

Max Hofmann

Part A: The wine data set

Exercise 01 (10 points)

Exercise 02 (10 points)

Exercise 03 (10 points)

Part B: Your own data set

Exercise 04 (20 points)

Exercise 05 (10 points)

Part C: Climate data set

Exercise 06 (10 points)

Exercise 07 (10 points)

Exercise 08 (20 points)

Module 5: Random Forest Models
Exercises: Assignment