(Instructor : Nishant Panda)
Computational statistics forms a computational bridge between probability models(math world) and statistical analysis (“data” real world). Lets look at Wiki for a distinction between Statistical computing and computational statistics.
https://en.wikipedia.org/wiki/Computational_statistics
The primary goal of the course is to teach you design of algorithms for the implementation of statistical methods on computers.
Main topics are :
You take a stick of length \(l\) and break it into 3 pieces. What is the probability that the three pieces make a triangle?
You roll a die \(n\) times and count the number of 1s, 2s, 6. Can you predict the next die outcome?
In this brief section we will look at all the mathematical preliminaries needed for the course.
We denote by \(\mathbb{R}^p\), the \(p\) dimensional Euclidean space. A point \(\mathbf{x}\) in \(\mathbb{R}^p\) will have \(p\) co-ordinates, and we write \(\mathbf{x} = (x_1, x_2, \dots, x_p)\). A function \(f\) from \(\mathbb{R}^p\) to \(\mathbb{R}^n\) is a vector valued function that takes a point \(\mathbf{x} = (x_1, x_2, \dots, x_p)\) in \(\mathbb{R}^p\) (input space) and gives a point \(\mathbf{y} = (f_1(\mathbf{x}), f_2(\mathbf{x}), \dots, f_n(\mathbf{x}))\) in \(\mathbb{R}^n\) (output space). An important case, atleast in this course, is when \(f\) takes a vector in \(\mathbb{R}^p\) and spits out a real number. For example, \(f(x,y) = x^2 + y^2\) takes a point in \(\mathbb{R}^2\) and gives a real number.
To visualize such functions, it is important to draw contour plots. A contour is the set of points in the input space for which the function is constant. What are the contours of the function \(f(x,y) = x^2 + y^2\)?
This class is also about writing functions! Our function for plotting 2D contours should take an expression and range of x and y. (What are expressions? Google it!)
# Let us create a sequence of x values
x.seq <- seq(from = -1.0, to = 1.0, length.out = 101)
y.seq <- seq(from = -1.0, to = 1.0, length.out = 101)
# create a grid of x,y values
plot.grid <- expand.grid(x = x.seq, y = y.seq)
# z values
z.values <- eval(expression(x^2 + y^2), envir = plot.grid)
z.matrix <- matrix(z.values, nrow = 101)
# plot the contour
contour(x = x.seq, y = y.seq, z = z.matrix)library(plot3D)
persp3D(x.seq, y.seq, z.matrix, phi = 45, theta = 45, image = TRUE,
xlab = "x Coordinate ", ylab = "y Coordinate ",
main = "f(x,y)"
)For real valued functions (output space is \(\mathbb{R}\)), the derivative at a point \(\mathbf{a} = (a_1, a_2, \dots, a_p)\) is given by the gradient, \[\nabla{f}(\mathbf{a}) = \left(\frac{\partial{f}(\mathbf{a})}{\partial{x_1}}, \dots, \frac{\partial{f}(\mathbf{a})}{\partial{x_p}}\right) \]
For example, the gradient of the function \(f(x_1,x_2) = x_1^2 + x_2^2\) at the point \((1,2)\) is given by \((2, 4)\). The gradient “points” at the direction where the function increases.
# Gradient in R
library(numDeriv)
# Let us create a function
my_func <- function(x){x[1]^2 + x[2]^2}
# get gradient at (1,2)
a <- c(1,2)
grad(my_func, a)## [1] 2 4
The second derivative of a real valued function at a point \(\mathbf{a} = (a_1, a_2, \dots, a_p)\) is given by the \(p\times p\) Hessian matrix, \[ H_{i,j}(\mathbf{a}) = \frac{\partial^{2}f(\mathbf{a})}{\partial x_i\partial x_j}\] For example, the Hessian of the function \(f(x_1,x_2) = x_1^2 + x_2^2\) is given by, \[ \begin{pmatrix} 2 & 0\\ 0 & 2 \end{pmatrix} \] and so at (any!) point \((1,2)\) is given by \[ \begin{pmatrix} 2 & 0\\ 0 & 2 \end{pmatrix} \]
# get Hessian
round(hessian(my_func, a), 2)## [,1] [,2]
## [1,] 2 0
## [2,] 0 2
We use capital letters \(X, Y, W\) etc. to denote random variables. A random vector \(\mathbf{X} = (X_1, X_2, \dots, X_p)\) will be denoted by boldface.
We write \(X \sim f(x)\) to mean that \(X\) is distributed with a density function \(f(x)\). To find the probability that \(X\) belongs to some set \(A\), we integrate the density over the set \(A\) i.e. \[ P\left(X \in A \right) = \int_{A}f(x)dx \] Typically a density is indexed by some parameters. For example, let \(W\) be the random variable that is the weight of an adult male. We believe that \(W\) is normally distributed with mean \(\mu\) and variance \(\sigma^2\) i.e \(W \sim N(\mu, \sigma^2)\). Hence, the (univariate) normal density is indexed by two parameters.
For any parametric density we denote the parameter vector by \(\mathbf{\theta}\), which is a point in \(\mathbb{R}^p\) for some \(p\). For example the univariate normal density has a parameter vector in \(\mathbb{R}^2\), \(\mathbf{\theta} = (\theta_1, \theta_2)\) where \(\theta_1\) is the mean(\(\mu\)) and \(\theta_2\) is the std. deviation(\(\sigma\)). In general, we say that a density \(f(x\vert \mathbf{\theta})\) is indexed by the parameter vector \(\mathbf{\theta}\).
Statistical inference using likelihood tries to figure out the parameter vector of our random variable using data. If \(X_1, X_2, X_3, \dots, X_n\) are i.i.d (random sample of size \(n\) from \(X\)) each having density \(f(x\vert\mathbf{\theta})\), with the unknown parameter vector \(\mathbf{\theta} = (\theta_1, \theta_2, \dots, \theta_p)\), then the (joint) likelihood function is, \[ L(\mathbf{\theta}) = \prod\limits_{i=1}^{n}f(x_i\vert\mathbf{\theta}), \] where, \(x_1, x_2, \dots, x_n\) is the observed data. Note that the likelihood function is real valued function from the input space \(\mathbb{R}^p\). (Strictly speaking the input space is a subset of \(\mathbb{R}^p\)). Hence, we can talk about its gradient \(\nabla L(\mathbf{\theta})\), Hessian etc!
What happens if the data is not i.i.d? Then \(L(\mathbf{\theta})\) is given by the joint density \(f(x_1, x_2, \dots, x_n \vert \mathbf{\theta})\) viewed as a function of \(\theta\).
Given data, we seek to find the parameter \(\theta\) that best explains the data. This is a loaded sentence but intuitively it means the following:
the observed data \(x_1, x_2, \dots, x_n\) might have been realized under many different values of \(\mathbf{\theta}\). The parameter vector for which observing \(x_1, x_2, \dots, x_n\) would be most likely constitute the maximum likelihood estimate of \(\mathbf{\theta}\).
Thus, the \(\mathbf{\theta}\), lets call it \(\theta_{ML}\) that maximizes \(L(\mathbf{\theta})\) will be the maximum likelihood estimate. Maximizing the likelihood involves finding the critical point (where the derivative is 0). It is tricky to work with finite products when taking the derivatives, hence we work with the log likelihood function , \[ l(\mathbf{\theta}) = log\left(L(\mathbf{\theta})\right) \] (log is monotonic increasing!). To find \(\theta_{ML}\), we need to set \[ \nabla l(\mathbf{\theta}) = \mathbf{0} \] and solve a system of \(p\) equations. The gradient of log likelihood is called the score function.
Let us look at an example. Suppose we know that weight \(W \sim N(\mu, \sigma^2)\), and we know the variance and hence the std. deviation \(\sigma\). Thus, our parameter vector is just a real number, \(\theta = \mu\). Suppose we observe \(4\) i.i.d weights, \(w_1, w_2, w_3\) and \(w_4\). What is \(L(\mathbf{\theta})\)? \[ L(\mu) = f(w_1\vert \mu) \times f(w_2\vert \mu) \times f(w_3\vert\mu) \times f(w_4\vert \mu), \] where, \[ f(w\lvert\mu) = \frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{(w-\mu)^2}{2\sigma^2}} \]
Hence, \[ log(f(w\lvert\mu)) = log\left(\frac{1}{\sqrt{2\pi}\sigma}\right) + log\left(e^{-\frac{(w-\mu)^2}{2\sigma^2}}\right) \]
Let us denote \(A = log\left(\frac{1}{\sqrt{2\pi}\sigma}\right)\), then, \[ log(f(w\lvert\mu)) = A - \frac{(w-\mu)^2}{2\sigma^2} \]