Exercise 1
Exercise 2
Exercise 3
Exercise 4
Exercise 5-7: Linear Regression for Classification
Exercise 8-10: Non-linear Transformation

The source code for this notebook can be found on Github.

library(tidyverse)
library(itertools)
library(doParallel)

Exercise 1

One experiment amounts to flipping n_coins coins, n_flips times.

experiment <- function(n_coins, n_flips) {
  X <- matrix(
    runif(n_coins * n_flips) < 0.5, 
    ncol = n_flips, 
    nrow = n_coins
  )
  fractions <- rowSums(X) / n_flips
  list(
    v_1 = fractions[1], 
    v_rand = sample(fractions, 1), 
    v_min = min(fractions)
  )
}

Let’s flip away!

X_stats <- mclapply(
  1:100000,
  function(x) experiment(n_coins = 1000, n_flips = 10),
  mc.cores = detectCores()
) %>% bind_rows
  
X_stats %>% 
  summarise_all(mean)

Answer \(b\).

Exercise 2

Coins \(c_1\) and \(c_{rand}\) are equivalent, and the distribution of their sample averages both satisfy Hoeffding’s single-bin inequality. \(c_{min}\) has a different distribution (\(\min\{C_1, \cdots, C_{1000}\}\) where \(C_i \sim B(10, 0.5)\)) and therefore does not satisfy the single-bin inequality; or rather, it does, but for a different distribution.

Therefore, the answer is \(d\).

Exercise 3

Let us denote the noisy version of \(f\) by \(f^{\prime}\). Given that \(f^{\prime}\) and \(h\) make their mistakes independently, we have that \(P(A \cup B) = P(A) + P(B)\) so that:

\[ P[h(x) \neq f^{\prime}(x)] =\\ = P\left[f^{\prime}(x) \neq f(x) \cap h(x) = f(x)\right] + P\left[f^{\prime}(x) = f(x) \cap h(x) \neq f(x)\right] = \\ =\mu\cdot\lambda + (1 - \mu)\cdot(1 - \lambda) \] Which gives us alternative \(e\).

Exercise 4

When \(\lambda = 1/2\), we get that \(\lambda = 1 - \lambda\), so that:

\[ P\left[h(x) \neq f^{\prime}(x)\right] = \mu \cdot \lambda + (1 - \mu) \cdot \lambda = \lambda \] Which gives us alternative \(b\).

Exercise 5-7: Linear Regression for Classification

We’ll use the same code for generating linearly separable datasets as from Homework 1.

plot_dataset(generate_ls_dataset(500))

Now we write down the formula for Ordinary Least-Squares Regression which we will later use.

olsr <- function(X, y) {
  solve(t(X) %*% X) %*% t(X) %*% y
}

Let’s check that it works by estimating the parameters for the noisy function \(f(x) = 3 + 2*x + \epsilon\) where \(\epsilon \sim \mathcal{N}(0, 4)\).

local({
  f <- function(x) 3 + 2*x + rnorm(x, sd = 4)
  X <- seq(1, 100, 0.01)
  y <- f(X)
  olsr(cbind(1, X), y)
})

##       [,1]
##   2.978704
## X 2.000670

And that’s a pretty decent estimate.

Finally, before running the experiments, let’s have a look at a hypothesis generated by our linear regression algorithm on a randomly generated dataset.

D <- generate_ls_dataset(500)
plot_olsr(D, olsr(get_x(D$X), D$z))

Again, looks good. Looks like we are good to go.

Exercise 5

summary(
  lapply(
    1:1000, 
    function(x) {
      D <- generate_ls_dataset(100)
      sum(f(D$X, olsr(get_x(D$X), D$z)) != f(D$X, D$w)) / 100
    }
  ) 
)

And we get answer \(d\) (mean is closer to \(0.1\) than to \(0.01\)). Would we get different results by using R’s lm function?

olsr_lm <- function(X, y) {
  x <- X[,-1]
  lm(y ~ x)$coefficients
}

summary(
  lapply(
    1:1000, 
    function(x) {
      D <- generate_ls_dataset(100)
      sum(f(D$X, olsr_lm(get_x(D$X), D$z)) != f(D$X, D$w)) / 100
    }
  )
)

Nope, pretty much the same.

Exercise 6

As in Homework 1, we will shoot points at the board and see which percentage of them stick in the area between the \(f\) (the reference function) and \(g\) (the approximation).

error_sim <- function(t, g, n, project = identity, .f = f) {
  X <- project(generate_points(n, -1, 1))
  sum(.f(X, t) != .f(X, g)) / n
}

summary(
  mclapply(
    1:100, 
    function(x) {
      D <- generate_ls_dataset(100)
      error_sim(D$w, olsr(get_x(D$X), D$z), n = 1000000)
    },
    mc.cores = 4
  )
)

And we still get answer \(d\). Are we looking at the right thing? Let’s plot the area we are looking at in orange.

local({
  # Fit line by regression.
  D <- generate_ls_dataset(100)
  g <- olsr(get_x(D$X), D$z)

  # Out-of-sample points.
  D_out <- generate_ls_dataset(10000, w = D$w)
  
  plot_olsr(D_out, w_final = g)
  
  # What we are measuring.
  i_diff <- f(D_out$X, D$w) != f(D_out$X, g)
  
  x_diff <- get_x(D_out$X[i_diff,])[,-1]
  y_diff <- get_y(D_out$X[i_diff,])

  points(y_diff ~ x_diff, col = 'orange', pch = 16)
})

It appears we are looking at the right area, and that our points are randomized just fine. It also seems that \(E_{in}\) tracks \(E_{out}\) quite well, as expected.

I cannot, therefore, pinpoint why my answer does not match with the answer in the published answers.

Exercise 7

We will need the perceptron again, so here it is:

pla_h <- function(X, w)  {
  sign(X %*% w)
}

pla_it <- function(X, z, w) {
  k <- which(pla_h(X, w) != z)
  if (is_empty(k)) {
    return(NULL)
  }
  k <- k[sample.int(length(k), 1)]
  w + z[k] * X[k,]
}

pla <- function(X, z, n = Inf, w = NULL) {
  w <- if (is.null(w)) rep(0, dim(X)[2]) else w
  c(list(as.list(w)), {
    w <- pla_it(X, z, w)
    if (!is.null(w) && n > 0) {
      pla(X, z, n - 1, w)
    }
  })
}

pla_line <- function(w) {
  coef <- c(intercept = -w[1]/w[3], slope = -w[2]/w[3])
  coef[is.nan(coef)] <- 0
  coef
}

Recalling that, in two dimensions, our perceptron has a penalty function of the form: \[ e({\mathbf{x}}, {\mathbf{w}}) = w_0 + w_1 \cdot x + w_2 \cdot y \]

to understand how to set the weights, note that regression has an output of the form:

\[ b + a \cdot x - y = 0 \]

we therefore set \(w_0 = b\), \(w_1 = a\), and \(w_2 = -1\).

summary(
  lapply(
    1:10000,
    function(x) {
      D <- generate_ls_dataset(10)
      length(pla(D$X, D$z, w = c(olsr(get_x(D$X), D$z), -1)))
    }
  )
)

Which gets us answer \(a\), the mean is closer to \(1\) than it is to \(15\).

Exercise 8-10: Non-linear Transformation

For the next exercises we will be using the noisy target function: \[ f(x_1, x_2) = X \cdot \text{sign}\,(x_1^2 + x_2^2 - 0.6) \]

where \(X \in \{-1, 1\}\) is a Bernoulli random variable with parameter \(1/10\) transformed so that it yiels \(-1\) for success and \(1\) for failure. Let us start by writting that down, and by having a look at a dataset.

nlf <- function(X, w) {
  sign(rowSums(X[,2:ncol(X)]**2) - 0.6)
}

nlf_noisy <- function(X, w) {
  z <- nlf(X)
  z_i <- sample(length(z))[1:(length(z) * .1)]
  z[z_i] <- -z[z_i]
  z
}

nlf_surface <- function(w) {
  r <- sqrt(0.6)
  x <- seq(-r, r, 0.01)
  y <- sapply(x, function(x) sqrt(r**2 - x**2))
  x <- c(x, rev(x))
  y <- c(y, -rev(y))
  lines(y ~ x, col = 'red', lty = 2, lwd = 4)
}

plot_dataset(generate_ls_dataset(n = 10000, target = nlf_noisy), surface = nlf_surface)

Exercise 8

summary(
  lapply(
    1:1000, 
    function(x) {
      D <- generate_ls_dataset(100, target = nlf_noisy)
      sum(f(D$X, olsr(get_x(D$X), D$z)) != D$z) / 100
    }
  )
)

And that puts us at \(50\%\) error (as good as flipping a coin!), or answer \(d\).

Exercise 9

We transform the features as prescribed.

nl_features <- function(X, Y) {
  cbind(1, X, Y, X*Y, X**2, Y**2)
}

w <- (Reduce(`+`,
  lapply(
    1:10000, 
    function(x) {
      D <- generate_ls_dataset(1000, target = nlf_noisy)
      olsr(nl_features(get_x(D$X, no_bias = TRUE), get_y(D$X)), D$z)
    }
  )
)/10000) %>% t %>% as.vector

names(w) <- sapply(0:5, function(x) sprintf('w_%d', x))
round(w, digits = 4) %>% t %>% as.tibble

And we get, by a somewhat ugly stretch, alternative \(a\). Though our weights are somewhat different, it does make sense that we assign close-to-zero weights to \(x_2\), \(x_3\), and \(x_4\) as those are not the features that are relevant to recovering the original surface. Indeed, we only need \(x_0\), \(x_5\) and \(x_6\) and, remarkably, our weights on these features are almost the same as the ones in the initial surface, scaled by \(w_0/0.6\).

Let’s try to plot the decision surface. Since inverting (i.e., solving for \(y\)) the equation for the surface algebraically is a bit of a pain, we will settle for a numeric contour:

plot_olsr_nl <- function(D, w_final, features, x_min, x_max, step) {
  y <- x <- seq(x_min, x_max, step)
  contour(
    x, y, outer(x, y, function(x, y) features(x, y) %*% w_final),
    add = TRUE,
    levels = c(0),
    col = 'green',
    lwd = 2
  )
}

D <- generate_ls_dataset(5000, target = nlf_noisy)
w <- olsr(nl_features(get_x(D$X, no_bias = TRUE), get_y(D$X)), D$z)

plot_dataset(D, surface = nlf_surface) 
plot_olsr_nl(D, w, nl_features, -1, 1, 0.001)

What happens if we try to fit a dataset that is linearly separable in two dimensions?

local({
  D <- generate_ls_dataset(5000)
  w <- olsr(nl_features(get_x(D$X, no_bias = TRUE), get_y(D$X)), D$z)

  plot_dataset(D) 
  plot_olsr_nl(D, w, nl_features, -1, 1, 0.001)
})

So, rather nicely, it tries to stretch the circle enough that it follows the straight line. And, assuming that \(\chi = \left[-1, 1 \right]\) and no \(E_out\) points exist outside of those boundaries (i.e., that our sample is unbiased), this is actually a good solution.

Exercise 10

Just by looking at the curve fitted by linear regression in Exercise 9 I would expect for us to get around \(10\%\) of the classes wrong. And this would indeed be great, as it would mean that our \(E_{out}\) would be essentially equivalent to the noise.

summary(
  mclapply(
    1:100,
    function(x) {
      D_test <- generate_ls_dataset(10000, target = nlf_noisy)
      sum(D_test$z != general_f(
        nl_features(get_x(D_test$X, no_bias = TRUE), get_y(D_test$X)), w
      )) / 100000
    },
    mc.cores = 4
  )
)

Well, the mean error is close to \(12\%\), so around \(2\%\) of true error. This corresponds nevertheless to alternative \(b\).

Learning From Data - Homework 2