The source code for this notebook can be found on Github.
library(tidyverse)
library(itertools)
library(doParallel)
One experiment amounts to flipping n_coins
coins, n_flips
times.
experiment <- function(n_coins, n_flips) {
X <- matrix(
runif(n_coins * n_flips) < 0.5,
ncol = n_flips,
nrow = n_coins
)
fractions <- rowSums(X) / n_flips
list(
v_1 = fractions[1],
v_rand = sample(fractions, 1),
v_min = min(fractions)
)
}
Let’s flip away!
X_stats <- mclapply(
1:100000,
function(x) experiment(n_coins = 1000, n_flips = 10),
mc.cores = detectCores()
) %>% bind_rows
X_stats %>%
summarise_all(mean)
Answer \(b\).
Coins \(c_1\) and \(c_{rand}\) are equivalent, and the distribution of their sample averages both satisfy Hoeffding’s single-bin inequality. \(c_{min}\) has a different distribution (\(\min\{C_1, \cdots, C_{1000}\}\) where \(C_i \sim B(10, 0.5)\)) and therefore does not satisfy the single-bin inequality; or rather, it does, but for a different distribution.
Therefore, the answer is \(d\).
Let us denote the noisy version of \(f\) by \(f^{\prime}\). Given that \(f^{\prime}\) and \(h\) make their mistakes independently, we have that \(P(A \cup B) = P(A) + P(B)\) so that:
\[ P[h(x) \neq f^{\prime}(x)] =\\ = P\left[f^{\prime}(x) \neq f(x) \cap h(x) = f(x)\right] + P\left[f^{\prime}(x) = f(x) \cap h(x) \neq f(x)\right] = \\ =\mu\cdot\lambda + (1 - \mu)\cdot(1 - \lambda) \] Which gives us alternative \(e\).
When \(\lambda = 1/2\), we get that \(\lambda = 1 - \lambda\), so that:
\[ P\left[h(x) \neq f^{\prime}(x)\right] = \mu \cdot \lambda + (1 - \mu) \cdot \lambda = \lambda \] Which gives us alternative \(b\).
We’ll use the same code for generating linearly separable datasets as from Homework 1.
plot_dataset(generate_ls_dataset(500))
Now we write down the formula for Ordinary Least-Squares Regression which we will later use.
olsr <- function(X, y) {
solve(t(X) %*% X) %*% t(X) %*% y
}
Let’s check that it works by estimating the parameters for the noisy function \(f(x) = 3 + 2*x + \epsilon\) where \(\epsilon \sim \mathcal{N}(0, 4)\).
local({
f <- function(x) 3 + 2*x + rnorm(x, sd = 4)
X <- seq(1, 100, 0.01)
y <- f(X)
olsr(cbind(1, X), y)
})
## [,1]
## 2.978704
## X 2.000670
And that’s a pretty decent estimate.
Finally, before running the experiments, let’s have a look at a hypothesis generated by our linear regression algorithm on a randomly generated dataset.
D <- generate_ls_dataset(500)
plot_olsr(D, olsr(get_x(D$X), D$z))
Again, looks good. Looks like we are good to go.
summary(
lapply(
1:1000,
function(x) {
D <- generate_ls_dataset(100)
sum(f(D$X, olsr(get_x(D$X), D$z)) != f(D$X, D$w)) / 100
}
)
)
And we get answer \(d\) (mean is closer to \(0.1\) than to \(0.01\)). Would we get different results by using R’s lm
function?
olsr_lm <- function(X, y) {
x <- X[,-1]
lm(y ~ x)$coefficients
}
summary(
lapply(
1:1000,
function(x) {
D <- generate_ls_dataset(100)
sum(f(D$X, olsr_lm(get_x(D$X), D$z)) != f(D$X, D$w)) / 100
}
)
)
Nope, pretty much the same.
As in Homework 1, we will shoot points at the board and see which percentage of them stick in the area between the \(f\) (the reference function) and \(g\) (the approximation).
error_sim <- function(t, g, n, project = identity, .f = f) {
X <- project(generate_points(n, -1, 1))
sum(.f(X, t) != .f(X, g)) / n
}
summary(
mclapply(
1:100,
function(x) {
D <- generate_ls_dataset(100)
error_sim(D$w, olsr(get_x(D$X), D$z), n = 1000000)
},
mc.cores = 4
)
)
And we still get answer \(d\). Are we looking at the right thing? Let’s plot the area we are looking at in orange.
local({
# Fit line by regression.
D <- generate_ls_dataset(100)
g <- olsr(get_x(D$X), D$z)
# Out-of-sample points.
D_out <- generate_ls_dataset(10000, w = D$w)
plot_olsr(D_out, w_final = g)
# What we are measuring.
i_diff <- f(D_out$X, D$w) != f(D_out$X, g)
x_diff <- get_x(D_out$X[i_diff,])[,-1]
y_diff <- get_y(D_out$X[i_diff,])
points(y_diff ~ x_diff, col = 'orange', pch = 16)
})
It appears we are looking at the right area, and that our points are randomized just fine. It also seems that \(E_{in}\) tracks \(E_{out}\) quite well, as expected.
I cannot, therefore, pinpoint why my answer does not match with the answer in the published answers.
We will need the perceptron again, so here it is:
pla_h <- function(X, w) {
sign(X %*% w)
}
pla_it <- function(X, z, w) {
k <- which(pla_h(X, w) != z)
if (is_empty(k)) {
return(NULL)
}
k <- k[sample.int(length(k), 1)]
w + z[k] * X[k,]
}
pla <- function(X, z, n = Inf, w = NULL) {
w <- if (is.null(w)) rep(0, dim(X)[2]) else w
c(list(as.list(w)), {
w <- pla_it(X, z, w)
if (!is.null(w) && n > 0) {
pla(X, z, n - 1, w)
}
})
}
pla_line <- function(w) {
coef <- c(intercept = -w[1]/w[3], slope = -w[2]/w[3])
coef[is.nan(coef)] <- 0
coef
}
Recalling that, in two dimensions, our perceptron has a penalty function of the form: \[ e({\mathbf{x}}, {\mathbf{w}}) = w_0 + w_1 \cdot x + w_2 \cdot y \]
to understand how to set the weights, note that regression has an output of the form:
\[ b + a \cdot x - y = 0 \]
we therefore set \(w_0 = b\), \(w_1 = a\), and \(w_2 = -1\).
summary(
lapply(
1:10000,
function(x) {
D <- generate_ls_dataset(10)
length(pla(D$X, D$z, w = c(olsr(get_x(D$X), D$z), -1)))
}
)
)
Which gets us answer \(a\), the mean is closer to \(1\) than it is to \(15\).
For the next exercises we will be using the noisy target function: \[ f(x_1, x_2) = X \cdot \text{sign}\,(x_1^2 + x_2^2 - 0.6) \]
where \(X \in \{-1, 1\}\) is a Bernoulli random variable with parameter \(1/10\) transformed so that it yiels \(-1\) for success and \(1\) for failure. Let us start by writting that down, and by having a look at a dataset.
nlf <- function(X, w) {
sign(rowSums(X[,2:ncol(X)]**2) - 0.6)
}
nlf_noisy <- function(X, w) {
z <- nlf(X)
z_i <- sample(length(z))[1:(length(z) * .1)]
z[z_i] <- -z[z_i]
z
}
nlf_surface <- function(w) {
r <- sqrt(0.6)
x <- seq(-r, r, 0.01)
y <- sapply(x, function(x) sqrt(r**2 - x**2))
x <- c(x, rev(x))
y <- c(y, -rev(y))
lines(y ~ x, col = 'red', lty = 2, lwd = 4)
}
plot_dataset(generate_ls_dataset(n = 10000, target = nlf_noisy), surface = nlf_surface)
summary(
lapply(
1:1000,
function(x) {
D <- generate_ls_dataset(100, target = nlf_noisy)
sum(f(D$X, olsr(get_x(D$X), D$z)) != D$z) / 100
}
)
)
And that puts us at \(50\%\) error (as good as flipping a coin!), or answer \(d\).
We transform the features as prescribed.
nl_features <- function(X, Y) {
cbind(1, X, Y, X*Y, X**2, Y**2)
}
w <- (Reduce(`+`,
lapply(
1:10000,
function(x) {
D <- generate_ls_dataset(1000, target = nlf_noisy)
olsr(nl_features(get_x(D$X, no_bias = TRUE), get_y(D$X)), D$z)
}
)
)/10000) %>% t %>% as.vector
names(w) <- sapply(0:5, function(x) sprintf('w_%d', x))
round(w, digits = 4) %>% t %>% as.tibble
And we get, by a somewhat ugly stretch, alternative \(a\). Though our weights are somewhat different, it does make sense that we assign close-to-zero weights to \(x_2\), \(x_3\), and \(x_4\) as those are not the features that are relevant to recovering the original surface. Indeed, we only need \(x_0\), \(x_5\) and \(x_6\) and, remarkably, our weights on these features are almost the same as the ones in the initial surface, scaled by \(w_0/0.6\).
Let’s try to plot the decision surface. Since inverting (i.e., solving for \(y\)) the equation for the surface algebraically is a bit of a pain, we will settle for a numeric contour:
plot_olsr_nl <- function(D, w_final, features, x_min, x_max, step) {
y <- x <- seq(x_min, x_max, step)
contour(
x, y, outer(x, y, function(x, y) features(x, y) %*% w_final),
add = TRUE,
levels = c(0),
col = 'green',
lwd = 2
)
}
D <- generate_ls_dataset(5000, target = nlf_noisy)
w <- olsr(nl_features(get_x(D$X, no_bias = TRUE), get_y(D$X)), D$z)
plot_dataset(D, surface = nlf_surface)
plot_olsr_nl(D, w, nl_features, -1, 1, 0.001)
What happens if we try to fit a dataset that is linearly separable in two dimensions?
local({
D <- generate_ls_dataset(5000)
w <- olsr(nl_features(get_x(D$X, no_bias = TRUE), get_y(D$X)), D$z)
plot_dataset(D)
plot_olsr_nl(D, w, nl_features, -1, 1, 0.001)
})
So, rather nicely, it tries to stretch the circle enough that it follows the straight line. And, assuming that \(\chi = \left[-1, 1 \right]\) and no \(E_out\) points exist outside of those boundaries (i.e., that our sample is unbiased), this is actually a good solution.
Just by looking at the curve fitted by linear regression in Exercise 9 I would expect for us to get around \(10\%\) of the classes wrong. And this would indeed be great, as it would mean that our \(E_{out}\) would be essentially equivalent to the noise.
summary(
mclapply(
1:100,
function(x) {
D_test <- generate_ls_dataset(10000, target = nlf_noisy)
sum(D_test$z != general_f(
nl_features(get_x(D_test$X, no_bias = TRUE), get_y(D_test$X)), w
)) / 100000
},
mc.cores = 4
)
)
Well, the mean error is close to \(12\%\), so around \(2\%\) of true error. This corresponds nevertheless to alternative \(b\).