Homework 6

R Markdown

Load the tissuesGeneExpression dataset and standardize the data for illustration as follows:

library(tissuesGeneExpression)
data(tissuesGeneExpression)
set.seed(1)
ind <- sample(nrow(e),500)
Y <- t(apply(e[ind,],1,scale))

Compute the singular value decomposition of Y using svd. This should return a list of 3 elements: d, u, and v. Why does s$u %*% s$d %*% t(s$v) return an error? (Hint: in the written form the SVD, U should be an m x p orthogonal matrix, V should be a p x p orthogonal matrix, and D should be an n x p diagonal matrix).

Fix the formatting of these matrices so that you can compute s$u %*% s$d %*% t(s$v). Compute the maximum residual between your result and Y. Is this residual small or large? Does this make sense? Display your code.
Plot the percent variability explained by each column of D. How many dimensions do we start with? How many dimensions do we need to explain 90% of the variability in Y? Display your code.
Load the GSE5859Subset dataset and compute both the singular value decomposition of the data with svd. Then, calculate U^TY as a new matrix z. Using the cor() function, find the column of z with the highest correlation to sampleInfo$group.Display your code.
In class, we discussed the concept of time-related batch effects, which show separations in data that should theoretically cluster. Look at the values of sampleInfo$group. How many discrete values are there? In this dataset, you can use sampleInfo$group to separate clusters. Given what you know about batch effects and the discrete values of sampleInfo$group, what do you think sampleInfo$group represents?
So far, all of our dimension reduction examples using euclidean distance calculations have worked well. Which of the following types of data can NEVER be reduced using SVD/PCA?

a) Quantitative variables with linear relationships to one another
b) Quantitative variables with nonlinear relationships to one another
c) Categorical variables, always
d) Quantitative variables, always
e) a and b

Answer questions 7-9 using the scatter plot below that contains a black line representing a decision boundary.

Given the location of the decision boundary (the vertical black line), what is the maximum accuracy a classifier attempting to classify points as belonging to group 1 or group 2 could acheive? (hint: accuracy has a precise definition for binary classification tasks)
Suppose you can move the location of decision boundary by shifting it left or right and rotating it. What is the maximum acheivable accuracy in this case?
The decision boundary shown on the plot is linear. Suppose you can change the decision boundary to be any non-linear function. What is the maximum acheivable accuracy now? (hint: you do not need to provide the definition for the real function you would like to use–just state the best acheivable result)
In this multi-part question, you will train a linear classifier for a binary classification task. Recall in class that we used linear models to perform regression, then tested the predictive ability of this linear regressor using a test dataset, which we did not use to train the model. Here, we will also use a linear model, but our task will instead be classification. Rather than using the lm() function, which can only handle regression tasks, we will use glm() (generalized linear model), which adapts linear models for classification tasks. The process of training and testing the model is identical to what we did in class, except that the labels are continuous.

For this question, you will need to download train_linear_classifier.csv and test_linear_classifier from https://github.com/gregmedlock/datascience_teaching. Read in the files so that they contain three columns named “x”, “y”, and “class”. Here, x and y are continuous values that you will use to predict class, which is either 1 or 2.

10a. Using the glm() function, train a linear model that predicts class using x and y as predictor variables. Within the glm() function, you should specify family=binomial(link='logit'). Hint: for classification problems, it is usually necessary to convert the column containing the class label to a factor, e.g. df$class = as.factor(df$class). This is especially true in our case, where the class is specified by 1 or 2; failing to convert to a factor may result in glm() performing regression rather than classification. After fitting the model to the training data, what are the coefficients for X and Y? What are the standard errors? Does the model summary indicate that these variables are “significant”? Display your code.

10b. Using the glm model you fit for 10a, predict the class of samples in the test set using “x” and “y” as input. In the predict() function, you should indicate type="response", and the output will be a probability. In this case, the probability will range from 0 and 1, where 1 indicates that the sample belongs to class 2, and 0 indicated that the sample belongs to class 1. Using a threshold of 0.5 (e.g. if probability is >0.5, the sample belongs to class 2), calculate the accuracy of your predictions for the test set. What value do you get? Display your code.

10c. We can improve the model further. For this part of the question, we will explore how non-linearity can greatly improve model performance, even if we still use the same underlying model. Your task for 10c and 10d is to conduct “feature engineering”, which is the process of manually creating a new predictor variable (i.e. feature) that is a function of other predictor variables in the dataset. After creating the feature, you then use that feature in training and testing, with the hope that it is a better predictor than any of the features on their own. Some examples of this for our specific dataset and task would be introducing a third variable that you construct with some of the following non-linear transformations: x^2, y^2, 2^x, x + y, x-y, x*y, sqrt(x), etc. First, plot the training dataset as a scatter plot and color the points by class. Based on this plot, what is the shape of the decision boundary that would perform better than a straight line? What are some examples of methematical transformations that you could apply to x and y so that the classes are easier to separate with a straight line? Plotting the sample values against the example transformations above might help with this (e.g. plot x vs. 2^x and see if a linear decision boundary looks like it will do well) Feel free to communicate your answer to the first part of this question by drawing the decision boundary on the plot. There is unlikely to be a simple function that allows a perfect decision boundary, but you should be able to improve on the inferrec boundary from 10b by quite a bit.

10d. Add your proposed feature from 10c to the training and test data (hint: the feature you choose should be a function of x and/or y; for each sample in both the training and test sets, you will compute a value for the new feature using the sample’s values of x and/or y). Train a glm() model as before, but use your new “engineered” dataset. Generate predictions for the test set using the new model. What is the accuracy of your model, using a 0.50 probability threshold as we did in 10b? Hint: this model is likely to fit much better than the previous one, but may raise warnings about predicted probabilities of 0 or 1. Don’t worry about these, since they just occur if the model is very good.

10e. Train a random forest classifier on the original training dataset that does not include the “engineered” feature. Use the randomForest package for this. What is the out-of-bag estimate of the error rate (hint: this is returned by the randomForest() function. Don’t worry about using the test set for this problem)? Why do you think the random forest performs so much better than the linear model?

Homework 6

Lee Talman and Greg Medlock

November 15, 2017

R Markdown