STAA 567 Lec 1: Introduction and preliminaries

Functions, derivative, likelihood and all that jazz!

In this brief section we will look at all the mathematical preliminaries needed for the course.

Functions and derivatives

We denote by \(\mathbb{R}^p\), the \(p\) dimensional Euclidean space. A point \(\mathbf{x}\) in \(\mathbb{R}^p\) will have \(p\) co-ordinates, and we write \(\mathbf{x} = (x_1, x_2, \dots, x_p)\). A function \(f\) from \(\mathbb{R}^p\) to \(\mathbb{R}^n\) is a vector valued function that takes a point \(\mathbf{x} = (x_1, x_2, \dots, x_p)\) in \(\mathbb{R}^p\) (input space) and gives a point \(\mathbf{y} = (f_1(\mathbf{x}), f_2(\mathbf{x}), \dots, f_n(\mathbf{x}))\) in \(\mathbb{R}^n\) (output space). An important case, atleast in this course, is when \(f\) takes a vector in \(\mathbb{R}^p\) and spits out a real number. For example, \(f(x,y) = x^2 + y^2\) takes a point in \(\mathbb{R}^2\) and gives a real number.

To visualize such functions, it is important to draw contour plots. A contour is the set of points in the input space for which the function is constant. What are the contours of the function \(f(x,y) = x^2 + y^2\)?

This class is also about writing functions! Our function for plotting 2D contours should take an expression and range of x and y. (What are expressions? Google it!)

# Let us create a sequence of x values
x.seq <- seq(from = -1.0, to = 1.0, length.out = 101)
y.seq <- seq(from = -1.0, to = 1.0, length.out = 101)

# create a grid of x,y values
plot.grid <- expand.grid(x = x.seq, y = y.seq)

# z values 
z.values <- eval(expression(x^2 + y^2), envir = plot.grid)
z.matrix <- matrix(z.values, nrow = 101)

# plot the contour
contour(x = x.seq, y = y.seq, z = z.matrix)

library(plot3D)
persp3D(x.seq, y.seq, z.matrix, phi = 45, theta = 45, image = TRUE,
  xlab = "x Coordinate ", ylab = "y Coordinate ",
  main = "f(x,y)"
)

(Home Assignment!)Try to write 2 functions- contour2D and surface3D. These function should take an expression (like x^2 + y^2) and plot contours and surface respectively

For real valued functions (output space is \(\mathbb{R}\)), the derivative at a point \(\mathbf{a} = (a_1, a_2, \dots, a_p)\) is given by the gradient, \[\nabla{f}(\mathbf{a}) = \left(\frac{\partial{f}(\mathbf{a})}{\partial{x_1}}, \dots, \frac{\partial{f}(\mathbf{a})}{\partial{x_p}}\right) \]

For example, the gradient of the function \(f(x_1,x_2) = x_1^2 + x_2^2\) at the point \((1,2)\) is given by \((2, 4)\). The gradient “points” at the direction where the function increases.

# Gradient in R
library(numDeriv)

# Let us create a function
my_func <- function(x){x[1]^2 + x[2]^2}

# get gradient at (1,2)
a <- c(1,2)
grad(my_func, a)

## [1] 2 4

(Home Assignment!)Try to write a function that plots the gradient on a 2D contour! The function should take an expression (like x^2 + y^2)

The second derivative of a real valued function at a point \(\mathbf{a} = (a_1, a_2, \dots, a_p)\) is given by the \(p\times p\) Hessian matrix, \[ H_{i,j}(\mathbf{a}) = \frac{\partial^{2}f(\mathbf{a})}{\partial x_i\partial x_j}\] For example, the Hessian of the function \(f(x_1,x_2) = x_1^2 + x_2^2\) is given by, \[ \begin{pmatrix} 2 & 0\\ 0 & 2 \end{pmatrix} \] and so at (any!) point \((1,2)\) is given by \[ \begin{pmatrix} 2 & 0\\ 0 & 2 \end{pmatrix} \]

# get Hessian
round(hessian(my_func, a), 2)

##      [,1] [,2]
## [1,]    2    0
## [2,]    0    2

(Home Assignment!)Write out the Hessian matrix for the function \(f(x_1, x_2) = 10 - (x_1^2 + x_2 -11)^2 - (x_1 + x_2^2 - 7)^2\). What is the Hessian matrix at the point (1,1)?

Probability distributions and Likelihood

We use capital letters \(X, Y, W\) etc. to denote random variables. A random vector \(\mathbf{X} = (X_1, X_2, \dots, X_p)\) will be denoted by boldface.

We write \(X \sim f(x)\) to mean that \(X\) is distributed with a density function \(f(x)\). To find the probability that \(X\) belongs to some set \(A\), we integrate the density over the set \(A\) i.e. \[ P\left(X \in A \right) = \int_{A}f(x)dx \] Typically a density is indexed by some parameters. For example, let \(W\) be the random variable that is the weight of an adult male. We believe that \(W\) is normally distributed with mean \(\mu\) and variance \(\sigma^2\) i.e \(W \sim N(\mu, \sigma^2)\). Hence, the (univariate) normal density is indexed by two parameters.

For any parametric density we denote the parameter vector by \(\mathbf{\theta}\), which is a point in \(\mathbb{R}^p\) for some \(p\). For example the univariate normal density has a parameter vector in \(\mathbb{R}^2\), \(\mathbf{\theta} = (\theta_1, \theta_2)\) where \(\theta_1\) is the mean(\(\mu\)) and \(\theta_2\) is the std. deviation(\(\sigma\)). In general, we say that a density \(f(x\vert \mathbf{\theta})\) is indexed by the parameter vector \(\mathbf{\theta}\).

(Home Assignment!)What are the parameter vectors for the following distribution: Cauchy, Exponential and Gamma?

Statistical inference using likelihood tries to figure out the parameter vector of our random variable using data. If \(X_1, X_2, X_3, \dots, X_n\) are i.i.d (random sample of size \(n\) from \(X\)) each having density \(f(x\vert\mathbf{\theta})\), with the unknown parameter vector \(\mathbf{\theta} = (\theta_1, \theta_2, \dots, \theta_p)\), then the (joint) likelihood function is, \[ L(\mathbf{\theta}) = \prod\limits_{i=1}^{n}f(x_i\vert\mathbf{\theta}), \] where, \(x_1, x_2, \dots, x_n\) is the observed data. Note that the likelihood function is real valued function from the input space \(\mathbb{R}^p\). (Strictly speaking the input space is a subset of \(\mathbb{R}^p\)). Hence, we can talk about its gradient \(\nabla L(\mathbf{\theta})\), Hessian etc!

What happens if the data is not i.i.d? Then \(L(\mathbf{\theta})\) is given by the joint density \(f(x_1, x_2, \dots, x_n \vert \mathbf{\theta})\) viewed as a function of \(\theta\).

Given data, we seek to find the parameter \(\theta\) that best explains the data. This is a loaded sentence but intuitively it means the following:

the observed data \(x_1, x_2, \dots, x_n\) might have been realized under many different values of \(\mathbf{\theta}\). The parameter vector for which observing \(x_1, x_2, \dots, x_n\) would be most likely constitute the maximum likelihood estimate of \(\mathbf{\theta}\).

Thus, the \(\mathbf{\theta}\), lets call it \(\theta_{ML}\) that maximizes \(L(\mathbf{\theta})\) will be the maximum likelihood estimate. Maximizing the likelihood involves finding the critical point (where the derivative is 0). It is tricky to work with finite products when taking the derivatives, hence we work with the log likelihood function , \[ l(\mathbf{\theta}) = log\left(L(\mathbf{\theta})\right) \] (log is monotonic increasing!). To find \(\theta_{ML}\), we need to set \[ \nabla l(\mathbf{\theta}) = \mathbf{0} \] and solve a system of \(p\) equations. The gradient of log likelihood is called the score function.

Let us look at an example. Suppose we know that weight \(W \sim N(\mu, \sigma^2)\), and we know the variance and hence the std. deviation \(\sigma\). Thus, our parameter vector is just a real number, \(\theta = \mu\). Suppose we observe \(4\) i.i.d weights, \(w_1, w_2, w_3\) and \(w_4\). What is \(L(\mathbf{\theta})\)? \[ L(\mu) = f(w_1\vert \mu) \times f(w_2\vert \mu) \times f(w_3\vert\mu) \times f(w_4\vert \mu), \] where, \[ f(w\lvert\mu) = \frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{(w-\mu)^2}{2\sigma^2}} \]

Hence, \[ log(f(w\lvert\mu)) = log\left(\frac{1}{\sqrt{2\pi}\sigma}\right) + log\left(e^{-\frac{(w-\mu)^2}{2\sigma^2}}\right) \]

Let us denote \(A = log\left(\frac{1}{\sqrt{2\pi}\sigma}\right)\), then, \[ log(f(w\lvert\mu)) = A - \frac{(w-\mu)^2}{2\sigma^2} \]

(Home Assignment!)Show that the log likelihood is given by, \[4A - \frac{1}{2\sigma^2}\sum\limits_{i=1}^{4}(w_i-\mu)^2\]

(Home Assignment!)Show that the gradient of log likelihood (score function) \(\nabla l(\mu)\) is given by, \[ \frac{1}{\sigma^2}\sum\limits_{i=1}^{4}(w_i-\mu)\]