Welcome to the final exam! You have two hours time to complete the following tasks. Please write all your solutions and answers into this Rmd document. At the end of the exam, please save this Rmd script and knit it as HTML document. Then, please send both the Rmd script and the HTML document to hofmann@iamo.de and vahe@datamotus.com
For this script to work, please first edit the directory used to define “dir” according to the location of the exam folder on your computer. Please also enter your name here:
dir <- "C:/Users/Max/Desktop/IAMO_neu/eLearning/Exam_Draft/"
my_name <- "Max Hofmann"
Please answer the following questions. Marking the correct answer(s) by adding “(x)” at the end of the correct answer(s), like in this example:
Question: Which day comes after Friday?
There can be between one and four correct answers in each question. Each question is worth 4 points.
In this part, we will work with marz-level agricultural statistics of different crops produced in Armenia (source: ARMSTAT). Please only work with one part of the data, according to the following list:
Please use the empty code chunk to write yourself code to complete the following tasks. Each task is worth 6 points:
Import the shapefile of Armenia (“marzes.shp”) and the agricultural statistics data set (“agric_stats.csv”). Use the function filter() to create a subset of the agricultural statistics data set according to the list above.
Using ggplot(), create a grouped bar graph in which you visualize the data subset. Each group should represent one year, and each bar within a group should represent one marz. Give each marz a distinct color, and give a title to your plot.
Calculate the long-year mean of the data subset using group_by() and summarize().
Use geo_join() to join the data subset with the shapefile. Using spplot(), create a map in which you color each marz according to the data. Give a title to your map.
## Solution:
# Task B.1:
library(ggplot2)
library(rgdal)
library(dplyr)
library(tigris)
data <- read.csv(paste0(dir, "agric_stats.csv"), sep=";")
data_subset <- filter(data, crop == "Winter Wheat", indicator == "Yield", year %in% c(2005:2009))
armenia <- readOGR(paste0(dir, "shapefile/marzes.shp"), verbose=F)
# Task B.2:
ggplot(data = data_subset, aes(x=year, y=value, fill=region)) +
geom_bar(position="dodge", stat="identity") +
ggtitle("Yield of winter wheat by marz, 2005-2009") +
xlab("") + ylab("Yield in dt/ha")
# Task B.3:
data_subset_mean <- data_subset %>% group_by(region) %>% summarize(mean_sown_area = mean(na.omit(value))) %>% as.data.frame()
# Task B.4:
armenia <- geo_join(armenia, data_subset_mean, "region", "region", how="left")
# Task B.5:
spplot(armenia, "mean_sown_area", main = "Yield of winter wheat, mean 2005-2009")
For this part, we will work with the “mtcars” data set. You find a brief description of this data set here: https://www.rdocumentation.org/packages/datasets/versions/3.6.2/topics/mtcars.
The following code includes a random forest model that predicts the variable “mpg”, which stands for fuel consumption (miles per gallon), with 10 different predictor variables.
Please execute the following code chunk and then answer the questions below. Please do not change the code. Each question is worth 6 points.
library(randomForest)
cars <- mtcars
help(mtcars)
seed <- as.numeric(paste(as.numeric(as.factor(strsplit(my_name, split="")[[1]]))[c(1:5)], collapse=""))
set.seed(seed)
subsample <- sample(nrow(cars), 0.75 * nrow(cars))
train <- cars[subsample, ]
test <- cars[-subsample, ]
model <- randomForest(mpg~., data = train, ntree = 300, mtry = 5, importance = TRUE)
model
##
## Call:
## randomForest(formula = mpg ~ ., data = train, ntree = 300, mtry = 5, importance = TRUE)
## Type of random forest: regression
## Number of trees: 300
## No. of variables tried at each split: 5
##
## Mean of squared residuals: 6.454454
## % Var explained: 81.99
varImpPlot(model, sort = T, n.var = 10, main = "Variable Importance")
prediction <- predict(model, newdata = test)
cor(prediction, test$mpg)^2
## [1] 0.7654653
partialPlot(model, pred.data = train, x.var = rownames(importance(model))[2])
partialPlot(model, pred.data = train, x.var = rownames(importance(model))[9])
# Please write your answer here.
## solution: 25 percent
# Please write your answer here.
## solution for "my_name" = "Max Hofmann": training: 82%; testing: 76.5%
# Please write your answer here.
## solution for "my_name" = "Max Hofmann": gear (number of forward gears), qsec (1/4 mile time) and carb (number of carburetors)
# Please write your answer here.
## solution for "my_name" = "Max Hofmann": Fuel consumption decreases with increasing displacement.