05/02/2018

Course Outline

  • Model \(N\)-number of complex events: How many individuals/products/firms in certain location? What are their evolutions? How do they interact?

  • Data: Non-linearity, non-normality, multivariates, dynamics.

  • Generalized Linear Models (GLM): Fixed, random and mixed effects. Mixture models. Hierarchical models. Point process and random fields.

  • Applications: Temporal dynamics (population growth) and network/spatial patterns (gravity model, epidemic model, congestion). (Note: Many structural econometrics model are in fact GLM.)

Course Information

  • Slides, codes, and supplements will be uploaded on Moodle https://moodleucl.uclouvain.be/ under the name LECON2603

  • Two presentations and one short report about your presentations (You also need to make comment to the other presentations).

  • Office hour: Friday 3pm-5pm, or by appointment (email: zhengyuan.gao@uclouvain.be)

  • Place: Voie du Roman Pays 34, Room C215

References

Not a single reference book covers the topics. However, you may find some of these books useful.

  • Extending the Linear Model with R by Faraway.
  • Multivariate Statistical Modelling Based on Generalized Linear Models by Fahrmeir and Tutz.
  • Generalized Linear Models by McCullagh and Nelder.
  • Statistical Analysis of Network Data with R by Kolaczyk and Csardi.
  • Spatial and Spatio-temporal Bayesian Models with R by Blangiardo and Cameletti.
  • Statistics for Spatio-Temporal Data by Cressie and Wikle.

What's happening in the high dimension?

  • A \(100\times100\) grid.

  • Each node in the grid has two states \(\{\mbox{Yes, No}\}\).

  • We model the state by a discrete random variable.

  • We need \(100^2=10000\) random variables for all nodes.

  • This is simple.

What's happening in the high dimension?

  • If all these nodes have influences on their neighbours, there problem becomes difficult.

  • We need a network representation for \(100^2\) random variables.

  • Solution: a structure can diminish the difficulty. E.g.

Parametric Statistical Model

  • Complex data: How to the represent the data, how to use the model and to learn the models?

  • Suitable assumptions: design models with the right level of complexity for probabilistic interpretations and tractable computations.

  • A parametric statistical model \(\mathcal{P}_{\Theta}\) is a collection of probability distributions defined on the same space and is parameterized by parameters \(\theta\) belonging to a set \(\theta\in\mathbb{R}^{d}\), \[\mathcal{P}_{\Theta}=\{p_{\theta}(\cdot)\mid\theta\in\Theta\}.\]

Poisson Distribution

  • Poisson distribution is a simple model for discrete and positive count data \[p_\theta (x) = \frac{e^{-\theta} \theta^x}{x!}.\]

  • \(\theta\): the only parameter, which is also the mean of the distribution. It models the intensity rate or the expected count \(\mbox{Var}[X]=\mathbb{E}[X]=\theta\).

  • Simulation: \(20\) observations with \(\theta = 1\).

rpois(n=20, 1)
##  [1] 2 1 1 1 1 2 0 0 0 0 2 0 0 0 1 0 3 0 1 0

Poisson Distribution

For \(\theta = 1, 5, 20\)

Poisson Likelihood

likelihood.function = function(x, theta)
{
  logL = dpois(x, theta, log=TRUE) # the log likelihood
  sumlogL = sum(logL)
  return(sumlogL)
}
x = rpois(n=1000, 8)
likelihood.function(x, 10)
## [1] -2656.922
likelihood.function(x, 8)
## [1] -2435.376

MLE

thetas = seq(0, 30, by=0.1)
loglike.vector = numeric(0)
for(i in 1:length(thetas))
{loglike.vector[i] = likelihood.function(x, thetas[i])}
theta.hat = thetas[which.max(loglike.vector)] 
theta.hat
## [1] 8

Likelihood Plot

plot(thetas, loglike.vector, ylab="Log Likelihood")
abline(v=theta.hat)

Bernouill Distribution

  • Consider a binary random variable \(X\) that can take the value \(0\) or \(1\).

  • If \(\Pr(X=1)\) is parametrized by \(\theta\in[0,1]\): \[\left\{ \begin{array}{ll} \Pr(X=1)=\theta\\ \Pr(X=0)=1-\theta \end{array}\right.\] then a probability distribution of the Bernoulli model: \[\Pr(X=x|\theta)=p_{\theta}(x)=\theta^{x}(1-\theta)^{1-x}\]

  • We can write \(X\sim\mbox{Ber}(\theta)\). The Bernoulli model is the collection of these distributions for \(\theta\in\Theta=[0,1]\).

Binomial Distribution

  • Binomial distribution is a canonical description of the observation process

  • An event can either happen with a certain probability (\(\pi\)) or not (with \(1−\pi\)), and we watch a number of times (\(n\)).

  • A binomial random variable \(\text{Bin}(n,\pi)\) is defined as the value of the sum of \(n\) i.i.d. Bernoulli random variables with parameter \(\pi\).

  • Parametrization: \(\theta = (n,\pi)\) two parameters.

Binomial Distribution

  • Distribution of a binomial random variable \(N\) is \[\Pr(N=k|\theta)=\binom{n}{k}\pi^{k}(1-\pi)^{n-k}.\]

  • Poisson distribution is a special case of the Binomial. If the number of trials \(n\) is large, the probability of a success \(\pi\) is small, then the binomial converges to the Poissson with an expected rate of events per unit time of \(\theta_{Poi} = n \pi\).

  • Mean \(\mathbb{E}[N]= n \pi\) and variance \(\mbox{Var}[N]=n\pi(1-\pi)\)

Binomial Distribution

  • Let us assume that each event is independently observed or heard with a constant detection probability of \(0.1\)

  • Draw \(1\) and \(10\) times binomial random number with sample size \(n=10\) and success probability \(\pi=0.1\).

pi.parameter = 0.1
rbinom(1, 10, pi.parameter)
## [1] 2
rbinom(10, 10, pi.parameter)
##  [1] 1 4 0 0 2 1 0 0 0 1

Binomial Distribution

hist(rbinom(10^6, 10, 0.4), breaks=50, col="gray")

Likelihood

  • \(\mbox{Bin}(0.4,1)=\mbox{Ber}(0.4)\). \[P(X=x|\pi)=\pi^{x}(1-\pi)^{1-x}.\]

  • Observe \(10^2\) times. The likelihood: \[\mbox{Likelihood}(\mathbf{x},0.4)= \Pi_{i=1}^{100} P(X=x_i | 0.4) = \Pi_{i=1}^{100} \theta^{x_i}(1-\theta)^{1-x_i}.\]

pi.parameter = 0.4; 
sequence = rbinom(10^2, 1, pi.parameter)

Likelihood

likelihood = function(observation, pi.parameter)
{
  likelihood = 1
  for (i in 1:length(observation))
  {
    if (observation[i] == 1)
    {likelihood = likelihood * pi.parameter}
    else
    {likelihood = likelihood * (1 - pi.parameter)}
  }
  return(likelihood)
}

Likelihood Plot

possible.pi = seq(0, 1, by = 0.001)
plot(possible.pi, sapply(possible.pi, 
          function (pi) {likelihood(sequence, pi)}), ylab ="")

MLE

mle.results = optimize(function(pi) {likelihood(sequence, pi)},
                       interval = c(0, 1), maximum = TRUE)
mle.results
## $maximum
## [1] 0.449986
## 
## $objective
## [1] 1.3017e-30
  • Try sequence = rbinom(1500, 1, pi.parameter). Floating point: The likelihood of our data is numerically indistinguishable from \(0\). Multiplying thousands of probabilities together is making the things worse.

Log-likelihood

Numerical Trap

Multinomial distribution

  • A discrete random variable that can take one of \(K\) possible values \(\{1,2,\ldots,K\}\)

  • It can be represented by a \(K\)-dimensional binary random variable \[X=(X_{1},X_{2},\ldots,X_{K})^{T}\] where \[\{k\mbox{ value is chosen}\}=\{X_k=1 \text{ and }X_{l}=0,\forall l\neq k\}\]

Multinomial distribution

  • Parametrize \(\Pr\{k\mbox{ value is chosen}\}\) by \(\pi_{k}\in[0,1]\) so \[\Pr(X_{k}=1)=\pi_{k}\ \ \forall k=1,2,\ldots,K,\,\mbox{and }\sum_{k=1}^{K}\pi_{k}=1.\]

  • The probability distribution over \(\mathbf{x}=(x_{1},\ldots,x_{k})\): \[p(\mathbf{x}|\pi,\pi^{T}\mathbf{1}=1)=p_{\theta}(\mathbf{x})=\prod_{k=1}^{K}\pi_{k}^{x_{k}}\] where \(\theta\in \Theta=\left\{ \pi\mid\sum_{k=1}^{K}\pi_{k}=1,\pi_{k}\in[0,1]\mbox{ for all }k\right\}\) and \(\pi=(\pi_{1},\ldots,\pi_{K})^{T}\).

Gaussian Distribution

  • A scalar variable \(X\) is Gaussian distributed: \(X\sim \mathcal{N}(x|\mu,\sigma^{2})\)

  • \(\mu\) is the mean and \(\sigma^{2}\) is the variance. The density function: \[(2\pi\sigma^{2})^{-1/2}\exp\left(-\frac{(x-\mu)^{2}}{2\sigma^{2}}\right)\]

  • For a \(d\)-dimensional vector \(\mathbf{x}\), \[\mathcal{N}(\mathbf{x}|\mu,\Sigma)=\frac{1}{(2\pi)^{d/2}}\frac{1}{|\Sigma|^{1/2}}\exp\left(-\frac{1}{2}(\mathbf{x}-\mu)^{T}\Sigma^{-1}(\mathbf{x}-\mu)\right)\] where \(\mu\) is a \(d\)-dimensional vector, \(\Sigma\) is a \(d\times d\) symmetric positive definite matrix, and \(|\Sigma|\) denotes the determinant of \(\Sigma\).

Gaussian Distribution

library(mvtnorm)
vector.mean = c(0,0); cov.matrix = matrix(c(1,0.8,0.8,1), 2,2) 
cov.matrix
##      [,1] [,2]
## [1,]  1.0  0.8
## [2,]  0.8  1.0

Gaussian Distribution

y = rmvnorm(n=30, mean=vector.mean, sigma=cov.matrix); plot(y)

Linear Regression from Probabilistic Perspective

  • A casual relation as a work with two nodes: one node corresponds to an input \(X\), and one node corresponds to an output \(Y\)

  • Model: joint distribution \(p(X,Y)\)

  • Conditional model: models the conditional probability of the output, given the input \[p(Y|X)\]

  • The linear regression models are all conditional models.

Linear Regression

  • \(Y\in\mathbb{R}\) depends linearly on \(X\in\mathbb{R}^{d}\).

  • \(\theta = (\beta,\sigma)\): \(\beta\in\mathbb{R}^{d}\) be a weighting vector, \(\sigma^{2}>0\).

  • Parametric assumption: \[Y\mid X\sim\mathcal{N}(\beta^{T}X,\sigma^{2}),\] which is equivalent to \[Y=\beta^{T}X+\varepsilon,\mbox{ with }\varepsilon\sim\mathcal{N}(0,\sigma^{2}).\]

  • Data set: \(\mathcal{D}=\lbrace(\mathbf{x}_{1},y_{1}),\cdots,(\mathbf{x}_{n},y_{n})\rbrace\) of i.i.d. random variables.

Linear Regression

  • Likelihood: \[p(y_{1},\cdots,y_{n}|\mathbf{x}_{1},\cdots,\mathbf{x}_{n},\beta,\sigma^{2})=\prod_{i=1}^{n}p(y_{i}|\mathbf{x}_{i},\beta,\sigma^{2}).\]

  • Log-likelihood:\[ -\ell(\beta,\sigma^{2})=-\sum_{i=1}^{n}\log p(y_{i}|\mathbf{x}_{i})=\frac{n}{2}\log(2\pi\sigma^{2})+\frac{1}{2}\sum_{i=1}^{n}\frac{(y_{i}-\beta^{T}\mathbf{x}_{i})^{2}}{\sigma^{2}}.\]

  • A compact expession: \[\hat{\beta}=\arg\min_{\beta}\frac{1}{2n}\|\mathbf{y}-\mathbf{X}\beta\|^{2},\mbox{where }\mathbf{X}=\begin{pmatrix}\mathbf{x}_{1}^{T}\\ \vdots\\ \mathbf{x}_{n}^{T} \end{pmatrix}\in\mathbb{R}^{n\times d}\]

Linear Regression

  • The squared norm \(\|\mathbf{y}-\mathbf{X}\beta\|^{2}\) equals \[\mathbf{y}^{T}\mathbf{y}-2\beta^{T}\mathbf{X}\mathbf{y}+\beta^{T}\mathbf{X}^{T}\mathbf{X}\beta\] and it is strictly convex if and only if its Hessian matrix \(\mathbf{X}^{T}\mathbf{X}\) is invertible.

  • The gradient of \(\|\mathbf{y}-\mathbf{X}\beta\|^{2}\) is \(\frac{1}{n}\mathbf{X}^{T}(\mathbf{X}\beta-\mathbf{y})=0\) which is known as the normal equation \[\mathbf{X}^{T}\mathbf{X}\beta=\mathbf{X}^{T}\mathbf{y}\]

  • If \(\mathbf{X}^{T}\mathbf{X}\) is invertible, then the optimal weighting vector is \(\hat{\beta}=(\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T}\mathbf{y}\)

Linear Regression

  • Differentiating \(\ell(\beta,\sigma^{2})\) w.r.t. \(\sigma^{2}\) \[\nabla_{\sigma^{2}}\ell(\beta,\sigma^{2})=\frac{n}{2\sigma^{2}}-\frac{n}{2\sigma^{4}}\frac{1}{n}\sum_{i=1}^{n}(y_{i}-\beta^{T}\mathbf{x}_{i})^{2}.\] gives \(\hat{\sigma}^{2}=n^{-1}\sum_{i=1}^{n}(y_{i}-\beta^{T}\mathbf{x}_{i})^{2}\).

Logit Function and logistic function

Logit Logistic

Logit Function and logistic function

\[\eta =\mbox{logit}(\theta) = \log \left(\frac{\theta}{1-\theta} \right),\]

\[\theta = \mbox{logit}^{-1}(\eta) = \frac{1}{1 + \exp(-\eta)} = \frac{\exp(\eta)}{ \exp(\eta) + 1}.\]

  • A class of functions (GLMs) \(f\) and \(f^{-1}\) may serve the same purpose, namely linking two types of parameters who are at different domains such as \(\theta \in (0,1)\) and \(\eta \in (-\infty, \infty)\).

  • \(\theta\) may serve the mean of \(Y\) and \(\eta\) may serve the mean of \(\beta X\).

  • Consider \(\mbox{logit}^{-1}\) function as \(f(\cdot)\), \(\mbox{logit}\) as \(f^{-1}(\cdot)=g(\cdot)\) called the link function.

Nonlinear Regression

  • \(X\in\mathbb{R}^{d}\)

  • \(Y\in\lbrace0,1\rbrace\) and \(Y\sim \mbox{Ber}(\theta)\).

  • Logistic function (inverse logit): connect \(\eta\) with \(\theta\in (0,1)\):\[\theta=f(\eta)=\frac{1}{1+e^{-\eta}}\mbox{ for any }\eta\in\mathbb{R}.\]

  • Properties: \(f(-\eta)=1-f(\eta)\) and \(f'(\eta)=f(\eta)(1-f(\eta))\).

  • Observation \(\mathbf{x}\). Output \(Y\) conditional on \(X=\mathbf{x}\)

  • Bernoulli distribution with parameter \(\theta=f(\beta^{T}\mathbf{x})\), where \(\beta\) is a \(d\)-dimensional weighting vector.

Model Generation and Simulated Data

  • \(N\) individuals have the same probability \(\pi\) of experiencing some event.

  • The number of events counted/successed \(Y\) follows a binomial distribution. (A single person: Bernoulli distribution.)

  • Model the number of counted events \(Y_t\) among all individuals \(N_t\) in year \(t\) for a total \(36\) years.

  • Function \(\pi_t=f(\beta^T\mathbf{x}_t)\) links \(\pi_t\) with \(\beta^T\mathbf{x}_t\). Also the inverse \(f^{-1}(\pi)=\beta^T\mathbf{x}_t\) as it exists \[\beta^T\mathbf{x}_t = \log\left( \frac{\pi_t}{1-\pi_t} \right)\]

Model Generation and Simulated Data

  • Random counted numbers \[Y_t \sim \mbox{Bin}(N, \pi_t)\]

  • Link the parameter \(\pi_t\) with systematic part \(\eta_t =\beta^T\mathbf{x}_t\) by the logit link function: \(\eta_t = \log\left( \frac{\pi_t}{1-\pi_t} \right)\)

  • The system: \[\eta_t =\beta^T\mathbf{x}_t = \beta_0 + \beta_1 x_t + \beta_2 x_t^2 \in \mathbb{R} .\]

  • Here, \(\pi\) is the expected proportion of success. It is the mean response for each of the \(N\) trials. \(\eta_t\) is that same proportion on the (logit) link scale.

Model Generation and Simulated Data

nyears = 36; beta0= 0; beta1 = -0.1; beta2 = -0.9;
year = 1:nyears;  x = (year-round(nyears/2)) / (nyears / 2)
N = round(runif(nyears, min = 20, max=100))
eta = beta0 + beta1 * x + beta2 * x^2
pi = plogis(eta) 
y = rbinom(nyears, N, pi)

Try by yourself

x = year
pi = plogis(eta, 0, mean(eta)) 

Model Generation and Simulated Data

plot(year, y/N, type="b"); points(year, pi, type="l", col="red")

Parametric Models

  • A model looks something like \[y= f(x,\beta).\]

  • \(y\): a response, something that our study system has produced and whose genesis we would like to understand

  • The response is a function \(f\) of one or several explanatory variables \(x\) and of some system descriptors, or parameters, \(\beta\).

  • Modeling means to replace a complicated reality of very large dimension with a much smaller set of system descriptors called paramters \(\theta\).

Parametric Models with Parametric Satatistical Models

  • A model looks something like \[Y= f(x,\beta) + \varepsilon \mbox{ with } \varepsilon \sim p_{\theta}\]

  • \(\varepsilon\): part of the response that is not explained by the functional form of the model \(f\), the explanatory variables \(x\) but by its parameter \(\theta\).

  • \(f\): a function describes the unexplained part using parameter \(\beta\).

  • Compactly expression: \(Y \sim p(x,\beta,\theta)\)

Parametric Models with Parametric Satatistical Models

  • The model means: response \(=\) systematic \(+\) stochastic

  • The previous modeling suggestion is one option.

  • We can consider a hierarchical structure: \[Y \sim p(\eta,\theta), \, \eta \sim p(x, \beta), \mbox{ and } X \sim p \]

  • At the bottom level of the hierachy, \(x\) is a realization of a random variable distributed as \(p\) without unknowns.

  • At the upper level, \(Y\) is distributed as \(p(\eta, \theta)\) given \(\eta=\beta^{T}x\).

Generalized Linear Models

  • The generalized linear model (GLM) extends the concept of a linear model with normally distribution \(Y\) to many other response distributions.

  • It includes the Poisson, binomial, Gaussian, multinomial, and many others.

  • Directly connecting non-Gaussian \(Y\) to \(f(x,\beta)\) would typically result in inadmissible values.

  • For example, \(Y=\{0,1\}\) but \(\beta^T\mathbf{x}\) is continuous.

Generalized Linear Models

  • The big advantage of GLM is that we can apply a linear model \(\eta_i =\beta^T\mathbf{x}_i\) to the response \(Y_i\) indirectly: by a transformation of its parameters, e.g. the mean response \(\theta_i=\mathbb{E}[Y]\) for Poisson variable \(Y\).

  • The function that makes the transformation is called the link function.

GLM

  1. The response \(Y_i\) has a statistical distribution \(p\) parameterized by the mean response \(\theta_i\): \[Y_i \sim p(\theta_i).\]

  2. A link function \(g\): \[g(\theta_i) =\eta_i\]

  3. Systematic part of the response is a linear regression:\[ \eta_i = \beta^T \mathbf{x}_i.\]

GLM

GLM is summarized in two lines: \[\begin{align} Y_i \sim & \, p(\theta_i) \\ g(\theta_i) =& \beta^T \mathbf{x}_i \end{align}\]

Link Function

Poisson GLM

  • Counting number of producted goods.

  • At each year, the productivity is different.

  • Productivity may increase across the years but may not be strictly monotone

  • Productivity increase faster at the beginning.

Poisson GLM

  • A differential equation: \[ \frac{d\theta(t)}{dt}= \eta'(t) \theta(t), \mbox{ with } \theta(0)=0.\] The solution is \(\theta(t)= e^{\eta(t)}\).

  • The productivity \(Y_t\) follows:\(Y_i \sim \mbox{Poi}(\theta_t).\)

  • Link function: \(\theta_t = e^{\eta_t}\)

  • Cubic polynomial growth: \[\eta_t = \beta_0 + \beta_1 x_t + \beta_2 x_t^2 + \beta_3 x_t^3.\]

Poisson GLM

nyears = 36; year = 1:nyears;  
beta0 = 3.5; beta1=-0.1; beta2 = 0.009; beta3 = -0.00014
eta = beta0 + beta1 * year + beta2 * year^2 + beta3 * year^3
theta = exp(eta); y = rpois(nyears, theta)

Poisson GLM

plot(year, y, type="b", ylab="Productivity", xlab="Year"); 
lines(year, theta, type="l", col="red")

Poisson GLM

product = glm(y ~ year + I(year^2) + I(year^3), family=poisson); 
summary(product)$coefficients
##                  Estimate   Std. Error   z value      Pr(>|z|)
## (Intercept)  3.7779415813 1.206720e-01 31.307524 3.686172e-215
## year        -0.1469553144 2.573805e-02 -5.709652  1.132072e-08
## I(year^2)    0.0111125511 1.482214e-03  7.497264  6.516385e-14
## I(year^3)   -0.0001679631 2.460346e-05 -6.826808  8.682462e-12