class exercise 3

In this class we will exercise the Linear Mixed Model, some fundemental concepts of Machine Learning, and the SVM algorithm.

Linear Mixed Models

Load the Arabidopsis data with data(Arabidopsis) (make sure that lme4 is loaded). Attach the dataset (attach(Arabidopsis)). The Arabidopsis dataset describes 625 plants with respect to the the following 8 variables (see ?Arabidopsis). In particular:

reg region: a factor with 3 levels NL (Netherlands), SP (Spain), SW (Sweden)

popu population: a factor with the form n.R representing a population in region R

gen genotype: a factor with 24 (numeric-valued) levels

rack a nuisance factor with 2 levels, one for each of two greenhouse racks

nutrient fertilization treatment/nutrient level (1, minimal nutrients or 8, added nutrients)

amd simulated herbivory or “clipping” (apical meristem damage): unclipped (baseline) or clipped

status a nuisance factor for germination method (Normal, Petri.Plate, or Transplant)

total.fruits total fruit set per plant (integer)

Here is the variable’s univariate distributions:

and here is the dataset structure:

str(Arabidopsis)

## 'data.frame':    625 obs. of  8 variables:
##  $ reg         : Factor w/ 3 levels "NL","SP","SW": 1 1 1 1 1 1 1 1 1 1 ...
##  $ popu        : Factor w/ 9 levels "1.SP","1.SW",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ gen         : int  4 4 4 4 4 4 4 4 4 5 ...
##  $ rack        : int  2 1 1 2 2 2 2 1 2 1 ...
##  $ nutrient    : int  1 1 1 1 8 1 1 1 8 1 ...
##  $ amd         : Factor w/ 2 levels "clipped","unclipped": 1 1 1 1 1 2 1 1 2 1 ...
##  $ status      : Factor w/ 3 levels "Normal","Petri.Plate",..: 3 2 1 1 3 2 1 1 1 2 ...
##  $ total.fruits: int  0 0 0 0 0 0 0 3 2 0 ...

Genotype, greenhouse rack and fertilizer are incorrectly interpreted as quantitative variables. Encode these as categorical. In addition, the distribution of total.fruits is right-skewed, log-transform it (add 1 to aviod \(log(0)\)) to approximate a Gaussian distribution, and verify this.
check the following output:

table(gen, popu)

##     popu
## gen  1.SP 1.SW 2.SW 3.NL 5.NL 5.SP 6.SP 7.SW 8.SP
##   4     0    0    0   31    0    0    0    0    0
##   5     0    0    0   11    0    0    0    0    0
##   6     0    0    0   13    0    0    0    0    0
##   11    0    0    0    0   35    0    0    0    0
##   12    0    0    0    0   26    0    0    0    0
##   13   39    0    0    0    0    0    0    0    0
##   14   26    0    0    0    0    0    0    0    0
##   15   35    0    0    0    0    0    0    0    0
##   16    0    0    0    0    0   43    0    0    0
##   17    0    0    0    0    0   22    0    0    0
##   18    0    0    0    0    0   12    0    0    0
##   19    0    0    0    0    0    0   13    0    0
##   20    0    0    0    0    0    0   24    0    0
##   21    0    0    0    0    0    0   14    0    0
##   22    0    0    0    0    0    0    0    0   13
##   23    0    0    0    0    0    0    0    0   16
##   24    0    0    0    0    0    0    0    0   35
##   25    0   28    0    0    0    0    0    0    0
##   27    0   20    0    0    0    0    0    0    0
##   28    0    0   18    0    0    0    0    0    0
##   30    0    0   14    0    0    0    0    0    0
##   34    0    0    0    0    0    0    0   45    0
##   35    0    0    0    0    0    0    0   47    0
##   36    0    0    0    0    0    0    0   45    0

What can you say about the relation between genotype and region?

3.Let \(X\) be the model matrix of the variables \((1,rack,nutrient,amd,status)\), and \(x\) is a row vector from \(X\). Present \(X\), and \(x_1\), i.e., the first row of \(X\).

Run a linear model assuming \(total.fruits|x \sim N(x'\beta, \sigma^2)\). What might be problematic in this modeling? Present graphical analysis of the residuals of this model that support your answer.
Use lmer and fit LMMs of the type: \[total.fruits|x,u = x'\beta + z'u + \epsilon\] where \(z\) is the random variables vector. assume \(\epsilon\) is iid normaly distributed.

Run the following models (save each model as an object): a. Random intercept for genotype b. Random intercept for population and seperate random intercept for region. c. Random intercept for each region and for each region-population interaction (use nested random effects)

For each of the above models present \(z_1\): the first row vector from model matrix \(Z\). hint: for model a: \(z_1=\)model.matrix(~0 + reg, data = Arabidopsis)[1,]). The goal of this section is to understand the \(z\) component in the formula (\(z\) is a vector of variables).
There is the possibility that the different researchers from the different regions might have handled and fertilized plants differently, thereby exerting slightly different impacts. This may indicate that a random nutrient slope is justified. Based on section 2, what can be an appropriate LMM model? Run it. (you might get a “boundary (singular) fit” messege. ignore it, or read ?isSingular)
Compare all models based on AIC. use anova the same as we did with lm() models.
Plot the density of random genotype effects (intercepts) from the model in section 5. Do they seem Gaussian?

Train, validation, and test sets, and SVM

In this section we will tune an SVM algorithm using e1071 packge on a validation set, to predict if diamond’s price in 10 categories. Load the diamonds dataset (make sure ggplot2 is loaded)

Call d a random subset of 4000 samples from diamonds (we take a subset from computational reasons). divide price into 10 levels (linearly) (use cut()) to generatr a categorical variable; name it price_cat. Split d into 3 sets: training, validation, and test, of ratios 0.5, 0.2, 0.3 respectively. Verify that your sets are of ratios 0.5-0.2-0.3 .
price_cat is the target, all other features except price are predictors. Dont use intercept. Generate the training, validation, and test set model matrices, name them X_trn, X_val, X_tst. (note: R’s defualt coding is “effect coding” this means factors are not cast to dummies (see chapter 7), taht’s ok).
Scale the model matrices, based on the means and sd’s of X_trn (use can use sweep() or caret::preProcess() for example). Why do we scale? why based on X_trn?
Check price distribution on the training set. What kind of prediction type is it?
use e1071::svm to fit an SVM classification algorithm on the training set from the previous question. The behavior of the model is very sensitive to the gamma parameter. Intuitively, the gamma parameter defines how far the influence of a single training example reaches. Read more about SVM parametes here.

Use the validation set to tune gamma. set type="C-classification" and kernel = "polynomial". Check for gammas in seq(0.001,0.5, length= 20). Use a for loop to find the best gamma value. Plot the tunning process (accuracy in the training and the test set as function of gamma). Note: there are better ways to tune models (i.e., without manually looping and fit new model each time. For R implementations, for example you can use the caret or mlr package). Yet, we do so from pedagigical reasons. You shold get somthing like:

What can you say about gamma and overfitting the training set?
Provide confusion matrices and compute the accuracies of your best SVM model on the train, validationm and test sets. Why do they differ?
Why is it important to find gamma on a validation set, and estimate model performance on the test set?