Welcome to the final exam! You have two hours time to complete the following tasks. Please write all your solutions and answers into this Rmd document. At the end of the exam, please save this Rmd script and knit it as HTML document. Then, please send both the Rmd script and the HTML document to hofmann@iamo.de and vahe@datamotus.com

For this script to work, please first edit the directory used to define “dir” according to the location of the exam folder on your computer. Please also enter your name here:

dir     <- "C:/Users/Max/Desktop/IAMO_neu/eLearning/Exam_Draft/"
my_name <- "Max Hofmann"

Part A: Multiple Choice Questions (24 points)

Please answer the following questions. Marking the correct answer(s) by adding “(x)” at the end of the correct answer(s), like in this example:

Question: Which day comes after Friday?

Monday
Saturday (x)
Sunday

There can be between one and four correct answers in each question. Each question is worth 4 points.

Question A.1: Which statements about climate change in Armenia are correct?

The trend in mean annual temperatures shows an increase over the last decades. (x)
The climate projections for RCP 4.5 forecast a stronger temperature increase than those for RCP 8.5.
The likelihood for longer and more intense heat waves will increase in the future. (x)
The likelihood for heavy precipitation events will decrease in the future.

Question A.2: Which data sets should ideally be taken into account in a crop yield prediction model?

Yield (x)
Soil properties (x)
Irrigation patterns (x)
Climate and weather data (x)

Question A.3: Which statements about NetCDF files are correct?

NetCDF files can store shapefiles.
NetCDF files can store metadata. (x)
NetCDF files can store multidimensional raster data. (x)
NetCDF files can store random forests.

Question A.4: What is the purpose of the Mann-Kendall test?

To assess winter chilling expressed in Utah chill units.
To assess temperature increase expressed in degrees Celsius.
To assess variable importance in a Random Forest model.
To assess the significance of a time trend. (x)

Question A.5: Which statements about Random Forest models are correct?

Random Forest models can be used for both classification and regression. (x)
Random Forest models can return a variable importance value for each predictor variable. (x)
Random Forest models require less computational power than simple linear regression models.
Random Forest models can model non-linear relationships between predictors and the response variable. (x)

Question A.6: Which statements about winter chilling are correct?

Chill unit models are a type of machine learning.
Sufficient winter chilling is an important requirement for spring wheat.
Utah chill units quantify the amount of hourly temperatures above and below a certain temperature threshold. (x)
Winter chilling can decrease with ongoing climate change. (x)

Part B: Code Writing (24 points)

In this part, we will work with marz-level agricultural statistics of different crops produced in Armenia (source: ARMSTAT). Please only work with one part of the data, according to the following list:

Amalya Misakyan: winter wheat, yield, 2005-2009
Armine Artenyan: winter wheat, sown area, 2006-2010
Armine Sahakyan: spring barley, production, 2007-2011
Garabet Kazanjian: potato, harvested area, 2008-2012
Gohar Shahinyan: cucumber, yield, 2009-2013
Gohar Yeghiazaryan: tomato, sown area, 2010-2014,
Hasmik Panyan: tomato, production, 2011-2015
Hripsime Mkrtchyan: pomaceous fruits, harvested area, 2012-2016
Kristina Baghdasaryan: stone fruits, yield, 2013-2017
Narek Tarasyan: berries, sown area, 2014-2018
Seryozha Karapetyan: winter wheat, production, 2015-2019
Yelena Khalatyan: winter wheat, harvested area, 2016-2020

Please use the empty code chunk to write yourself code to complete the following tasks. Each task is worth 6 points:

Task B.1:

Import the shapefile of Armenia (“marzes.shp”) and the agricultural statistics data set (“agric_stats.csv”). Use the function filter() to create a subset of the agricultural statistics data set according to the list above.

Task B.2:

Using ggplot(), create a grouped bar graph in which you visualize the data subset. Each group should represent one year, and each bar within a group should represent one marz. Give each marz a distinct color, and give a title to your plot.

Task B.3:

Calculate the long-year mean of the data subset using group_by() and summarize().

Task B.4:

Use geo_join() to join the data subset with the shapefile. Using spplot(), create a map in which you color each marz according to the data. Give a title to your map.

## Solution:

# Task B.1:
library(ggplot2)
library(rgdal)
library(dplyr)
library(tigris)

data    <- read.csv(paste0(dir, "agric_stats.csv"), sep=";")
data_subset <- filter(data, crop == "Winter Wheat", indicator == "Yield", year %in% c(2005:2009))
armenia <- readOGR(paste0(dir, "shapefile/marzes.shp"), verbose=F)

# Task B.2: 
ggplot(data = data_subset, aes(x=year, y=value, fill=region)) +
            geom_bar(position="dodge", stat="identity") +
            ggtitle("Yield of winter wheat by marz, 2005-2009") +
            xlab("") + ylab("Yield in dt/ha")

# Task B.3: 
data_subset_mean <- data_subset %>% group_by(region) %>% summarize(mean_sown_area = mean(na.omit(value))) %>% as.data.frame()

# Task B.4: 
armenia <- geo_join(armenia, data_subset_mean, "region", "region", how="left")

# Task B.5: 
spplot(armenia, "mean_sown_area", main = "Yield of winter wheat, mean 2005-2009")

Part C: Code Interpretation (24 points)

For this part, we will work with the “mtcars” data set. You find a brief description of this data set here: https://www.rdocumentation.org/packages/datasets/versions/3.6.2/topics/mtcars.

The following code includes a random forest model that predicts the variable “mpg”, which stands for fuel consumption (miles per gallon), with 10 different predictor variables.

Please execute the following code chunk and then answer the questions below. Please do not change the code. Each question is worth 6 points.

library(randomForest)
cars <- mtcars
help(mtcars)

seed <- as.numeric(paste(as.numeric(as.factor(strsplit(my_name, split="")[[1]]))[c(1:5)], collapse=""))
set.seed(seed)

subsample <- sample(nrow(cars), 0.75 * nrow(cars))
train <- cars[subsample, ]
test  <- cars[-subsample, ]

model <- randomForest(mpg~., data = train, ntree = 300, mtry = 5, importance = TRUE)

model

## 
## Call:
##  randomForest(formula = mpg ~ ., data = train, ntree = 300, mtry = 5,      importance = TRUE) 
##                Type of random forest: regression
##                      Number of trees: 300
## No. of variables tried at each split: 5
## 
##           Mean of squared residuals: 6.454454
##                     % Var explained: 81.99

varImpPlot(model, sort = T, n.var = 10, main = "Variable Importance")

prediction <- predict(model, newdata = test)
cor(prediction, test$mpg)^2

## [1] 0.7654653

partialPlot(model, pred.data = train, x.var = rownames(importance(model))[2])

partialPlot(model, pred.data = train, x.var = rownames(importance(model))[9])

Question C.1: How much percent of the data was used for testing the model?

# Please write your answer here.
## solution: 25 percent

Question C.2: How much percent of the variation in fuel consumption is explained by the model in the training and testing dataset, respectively?

# Please write your answer here.
## solution for "my_name" = "Max Hofmann": training: 82%; testing: 76.5%

Question C.3: What are the three least important predictor variables in the model, according to the percent increase in mean quared error?

# Please write your answer here.
## solution for "my_name" = "Max Hofmann": gear (number of forward gears), qsec (1/4 mile time) and carb (number of carburetors)

Question C.4: Please describe the relationship between the second most important predictor variable and fuel consumption.

# Please write your answer here.
## solution for "my_name" = "Max Hofmann": Fuel consumption decreases with increasing displacement.

Final Exam, December 14th 2022