Overview of Statistical Learning

library(ggplot2)
set.seed(481)

The question

We assume there is a function \( f: X \rightarrow Y \) that gives us the information we want or need. This assumption is the assumption that there is a pattern in the data we are looking at.

If we already have this function, and it is reasonably computable, there is nothing more to do, so we assume the function is (for practical purposes) unknown. Also likely is that the data we can obtain about this function are noisy. Note that in principle this noise can vary with input, but we often assume it has mean \( 0 \) and otherwise fixed distribution.

We want to either (prediction) get information on \( f(x) \) given arbitrary \( x \), or (inference) get information on the behavior of the function \( f \).

Note that for different reasons not all elements of \( X \) need to be equally likely. It is possible that this data is generated non-uniformly, and that we want to take this into account. So we assume there is a probability distribution on \( X \).

The Data

Assume we have

\[ (x_0, y_0), \ldots, (x_n, y_n), \text{ with } f(x_i) = y_i \]

There are different questions you can ask about the data; how reliable is it, is there enough of it. Mainly we see this as the input to the learning algorithm. Note that by the probability distribution on \( X \) not all \( x_i \) are equally likely to occur.

Examples

If \( X \) has a uniform distribution on \( [2,6] \), and \( y \) deterministically determined by \( x \) as \( f(x) = x^2 \), this situation would be coded in R as follows

input.get <- function() {
    runif(n = 1, min = 2, max = 6)
}
f <- function(x) {
    x^2
}
datum.get <- function() {
    x <- input.get()
    y <- f(x)
    c(x, y)
}
data <- data.frame(t(sapply(seq(100), function(x) {
    datum.get()
})))
names(data) <- c("x", "y")
ggplot(data, aes(x = x, y = y)) + geom_point() + geom_rug(sides = "b")

plot of chunk unnamed-chunk-1

If \( X \) is instead normal with mean \( 4 \) and variance \( 2 \) we get

input.get <- function() {
    rnorm(n = 1, mean = 4, sd = 2)
}
data <- data.frame(t(sapply(seq(200), function(x) {
    datum.get()
})))
names(data) <- c("x", "y")
ggplot(data, aes(x = x, y = y)) + geom_point() + geom_rug(sides = "b")

plot of chunk unnamed-chunk-2

If then also the function \( f \) has some noise, lets say with mean \( 0 \) and standard deviation \( 3 \) we get

f <- function(x) {
    x^2 + rnorm(n = 1, mean = 0, sd = 3)
}
data <- data.frame(t(sapply(seq(200), function(x) {
    datum.get()
})))
names(data) <- c("x", "y")
ggplot(data, aes(x = x, y = y)) + geom_point() + geom_rug(sides = "b")

plot of chunk unnamed-chunk-3