Topic in Econometrics (2018)

05/02/2018

Course Outline

Model \(N\)-number of complex events: How many individuals/products/firms in certain location? What are their evolutions? How do they interact?
Data: Non-linearity, non-normality, multivariates, dynamics.
Generalized Linear Models (GLM): Fixed, random and mixed effects. Mixture models. Hierarchical models. Point process and random fields.
Applications: Temporal dynamics (population growth) and network/spatial patterns (gravity model, epidemic model, congestion). (Note: Many structural econometrics model are in fact GLM.)

Course Information

Slides, codes, and supplements will be uploaded on Moodle https://moodleucl.uclouvain.be/ under the name LECON2603
Two presentations and one short report about your presentations (You also need to make comment to the other presentations).
Office hour: Friday 3pm-5pm, or by appointment (email: zhengyuan.gao@uclouvain.be)
Place: Voie du Roman Pays 34, Room C215

References

Not a single reference book covers the topics. However, you may find some of these books useful.

Extending the Linear Model with R by Faraway.
Multivariate Statistical Modelling Based on Generalized Linear Models by Fahrmeir and Tutz.
Generalized Linear Models by McCullagh and Nelder.
Statistical Analysis of Network Data with R by Kolaczyk and Csardi.
Spatial and Spatio-temporal Bayesian Models with R by Blangiardo and Cameletti.
Statistics for Spatio-Temporal Data by Cressie and Wikle.

What's happening in the high dimension?

A \(100\times100\) grid.
Each node in the grid has two states \(\{\mbox{Yes, No}\}\).
We model the state by a discrete random variable.
We need \(100^2=10000\) random variables for all nodes.
This is simple.

What's happening in the high dimension?

If all these nodes have influences on their neighbours, there problem becomes difficult.
We need a network representation for \(100^2\) random variables.
Solution: a structure can diminish the difficulty. E.g.

Parametric Statistical Model

Complex data: How to the represent the data, how to use the model and to learn the models?
Suitable assumptions: design models with the right level of complexity for probabilistic interpretations and tractable computations.
A parametric statistical model \(\mathcal{P}_{\Theta}\) is a collection of probability distributions defined on the same space and is parameterized by parameters \(\theta\) belonging to a set \(\theta\in\mathbb{R}^{d}\), \[\mathcal{P}_{\Theta}=\{p_{\theta}(\cdot)\mid\theta\in\Theta\}.\]

Poisson Distribution

Poisson distribution is a simple model for discrete and positive count data \[p_\theta (x) = \frac{e^{-\theta} \theta^x}{x!}.\]
\(\theta\): the only parameter, which is also the mean of the distribution. It models the intensity rate or the expected count \(\mbox{Var}[X]=\mathbb{E}[X]=\theta\).
Simulation: \(20\) observations with \(\theta = 1\).

rpois(n=20, 1)

##  [1] 2 1 1 1 1 2 0 0 0 0 2 0 0 0 1 0 3 0 1 0

Poisson Distribution

For \(\theta = 1, 5, 20\)

Poisson Likelihood

likelihood.function = function(x, theta)
{
  logL = dpois(x, theta, log=TRUE) # the log likelihood
  sumlogL = sum(logL)
  return(sumlogL)
}
x = rpois(n=1000, 8)
likelihood.function(x, 10)

## [1] -2656.922

likelihood.function(x, 8)

## [1] -2435.376

MLE

thetas = seq(0, 30, by=0.1)
loglike.vector = numeric(0)
for(i in 1:length(thetas))
{loglike.vector[i] = likelihood.function(x, thetas[i])}
theta.hat = thetas[which.max(loglike.vector)] 
theta.hat

## [1] 8

Likelihood Plot

plot(thetas, loglike.vector, ylab="Log Likelihood")
abline(v=theta.hat)

Bernouill Distribution

Consider a binary random variable \(X\) that can take the value \(0\) or \(1\).
If \(\Pr(X=1)\) is parametrized by \(\theta\in[0,1]\): \[\left\{ \begin{array}{ll} \Pr(X=1)=\theta\\ \Pr(X=0)=1-\theta \end{array}\right.\] then a probability distribution of the Bernoulli model: \[\Pr(X=x|\theta)=p_{\theta}(x)=\theta^{x}(1-\theta)^{1-x}\]
We can write \(X\sim\mbox{Ber}(\theta)\). The Bernoulli model is the collection of these distributions for \(\theta\in\Theta=[0,1]\).

Binomial Distribution

Binomial distribution is a canonical description of the observation process
An event can either happen with a certain probability (\(\pi\)) or not (with \(1−\pi\)), and we watch a number of times (\(n\)).
A binomial random variable \(\text{Bin}(n,\pi)\) is defined as the value of the sum of \(n\) i.i.d. Bernoulli random variables with parameter \(\pi\).
Parametrization: \(\theta = (n,\pi)\) two parameters.

Binomial Distribution

Distribution of a binomial random variable \(N\) is \[\Pr(N=k|\theta)=\binom{n}{k}\pi^{k}(1-\pi)^{n-k}.\]
Poisson distribution is a special case of the Binomial. If the number of trials \(n\) is large, the probability of a success \(\pi\) is small, then the binomial converges to the Poissson with an expected rate of events per unit time of \(\theta_{Poi} = n \pi\).
Mean \(\mathbb{E}[N]= n \pi\) and variance \(\mbox{Var}[N]=n\pi(1-\pi)\)

Binomial Distribution

Let us assume that each event is independently observed or heard with a constant detection probability of \(0.1\)
Draw \(1\) and \(10\) times binomial random number with sample size \(n=10\) and success probability \(\pi=0.1\).

pi.parameter = 0.1
rbinom(1, 10, pi.parameter)

## [1] 2

rbinom(10, 10, pi.parameter)

##  [1] 1 4 0 0 2 1 0 0 0 1

Binomial Distribution

hist(rbinom(10^6, 10, 0.4), breaks=50, col="gray")

Likelihood

\(\mbox{Bin}(0.4,1)=\mbox{Ber}(0.4)\). \[P(X=x|\pi)=\pi^{x}(1-\pi)^{1-x}.\]
Observe \(10^2\) times. The likelihood: \[\mbox{Likelihood}(\mathbf{x},0.4)= \Pi_{i=1}^{100} P(X=x_i | 0.4) = \Pi_{i=1}^{100} \theta^{x_i}(1-\theta)^{1-x_i}.\]

pi.parameter = 0.4; 
sequence = rbinom(10^2, 1, pi.parameter)

Likelihood

likelihood = function(observation, pi.parameter)
{
  likelihood = 1
  for (i in 1:length(observation))
  {
    if (observation[i] == 1)
    {likelihood = likelihood * pi.parameter}
    else
    {likelihood = likelihood * (1 - pi.parameter)}
  }
  return(likelihood)
}

Likelihood Plot

possible.pi = seq(0, 1, by = 0.001)
plot(possible.pi, sapply(possible.pi, 
          function (pi) {likelihood(sequence, pi)}), ylab ="")

MLE

mle.results = optimize(function(pi) {likelihood(sequence, pi)},
                       interval = c(0, 1), maximum = TRUE)
mle.results

## $maximum
## [1] 0.449986
## 
## $objective
## [1] 1.3017e-30

Try sequence = rbinom(1500, 1, pi.parameter). Floating point: The likelihood of our data is numerically indistinguishable from \(0\). Multiplying thousands of probabilities together is making the things worse.

Log-likelihood

Numerical Trap

Multinomial distribution

A discrete random variable that can take one of \(K\) possible values \(\{1,2,\ldots,K\}\)
It can be represented by a \(K\)-dimensional binary random variable \[X=(X_{1},X_{2},\ldots,X_{K})^{T}\] where \[\{k\mbox{ value is chosen}\}=\{X_k=1 \text{ and }X_{l}=0,\forall l\neq k\}\]

Multinomial distribution

Parametrize \(\Pr\{k\mbox{ value is chosen}\}\) by \(\pi_{k}\in[0,1]\) so \[\Pr(X_{k}=1)=\pi_{k}\ \ \forall k=1,2,\ldots,K,\,\mbox{and }\sum_{k=1}^{K}\pi_{k}=1.\]
The probability distribution over \(\mathbf{x}=(x_{1},\ldots,x_{k})\): \[p(\mathbf{x}|\pi,\pi^{T}\mathbf{1}=1)=p_{\theta}(\mathbf{x})=\prod_{k=1}^{K}\pi_{k}^{x_{k}}\] where \(\theta\in \Theta=\left\{ \pi\mid\sum_{k=1}^{K}\pi_{k}=1,\pi_{k}\in[0,1]\mbox{ for all }k\right\}\) and \(\pi=(\pi_{1},\ldots,\pi_{K})^{T}\).

Gaussian Distribution

A scalar variable \(X\) is Gaussian distributed: \(X\sim \mathcal{N}(x|\mu,\sigma^{2})\)
\(\mu\) is the mean and \(\sigma^{2}\) is the variance. The density function: \[(2\pi\sigma^{2})^{-1/2}\exp\left(-\frac{(x-\mu)^{2}}{2\sigma^{2}}\right)\]
For a \(d\)-dimensional vector \(\mathbf{x}\), \[\mathcal{N}(\mathbf{x}|\mu,\Sigma)=\frac{1}{(2\pi)^{d/2}}\frac{1}{|\Sigma|^{1/2}}\exp\left(-\frac{1}{2}(\mathbf{x}-\mu)^{T}\Sigma^{-1}(\mathbf{x}-\mu)\right)\] where \(\mu\) is a \(d\)-dimensional vector, \(\Sigma\) is a \(d\times d\) symmetric positive definite matrix, and \(|\Sigma|\) denotes the determinant of \(\Sigma\).

Gaussian Distribution

library(mvtnorm)
vector.mean = c(0,0); cov.matrix = matrix(c(1,0.8,0.8,1), 2,2) 
cov.matrix

##      [,1] [,2]
## [1,]  1.0  0.8
## [2,]  0.8  1.0

Gaussian Distribution

y = rmvnorm(n=30, mean=vector.mean, sigma=cov.matrix); plot(y)

Linear Regression from Probabilistic Perspective

A casual relation as a work with two nodes: one node corresponds to an input \(X\), and one node corresponds to an output \(Y\)
Model: joint distribution \(p(X,Y)\)
Conditional model: models the conditional probability of the output, given the input \[p(Y|X)\]
The linear regression models are all conditional models.

Linear Regression

\(Y\in\mathbb{R}\) depends linearly on \(X\in\mathbb{R}^{d}\).
\(\theta = (\beta,\sigma)\): \(\beta\in\mathbb{R}^{d}\) be a weighting vector, \(\sigma^{2}>0\).
Parametric assumption: \[Y\mid X\sim\mathcal{N}(\beta^{T}X,\sigma^{2}),\] which is equivalent to \[Y=\beta^{T}X+\varepsilon,\mbox{ with }\varepsilon\sim\mathcal{N}(0,\sigma^{2}).\]
Data set: \(\mathcal{D}=\lbrace(\mathbf{x}_{1},y_{1}),\cdots,(\mathbf{x}_{n},y_{n})\rbrace\) of i.i.d. random variables.

Linear Regression

Likelihood: \[p(y_{1},\cdots,y_{n}|\mathbf{x}_{1},\cdots,\mathbf{x}_{n},\beta,\sigma^{2})=\prod_{i=1}^{n}p(y_{i}|\mathbf{x}_{i},\beta,\sigma^{2}).\]
Log-likelihood:\[ -\ell(\beta,\sigma^{2})=-\sum_{i=1}^{n}\log p(y_{i}|\mathbf{x}_{i})=\frac{n}{2}\log(2\pi\sigma^{2})+\frac{1}{2}\sum_{i=1}^{n}\frac{(y_{i}-\beta^{T}\mathbf{x}_{i})^{2}}{\sigma^{2}}.\]
A compact expession: \[\hat{\beta}=\arg\min_{\beta}\frac{1}{2n}\|\mathbf{y}-\mathbf{X}\beta\|^{2},\mbox{where }\mathbf{X}=\begin{pmatrix}\mathbf{x}_{1}^{T}\\ \vdots\\ \mathbf{x}_{n}^{T} \end{pmatrix}\in\mathbb{R}^{n\times d}\]

Linear Regression

The squared norm \(\|\mathbf{y}-\mathbf{X}\beta\|^{2}\) equals \[\mathbf{y}^{T}\mathbf{y}-2\beta^{T}\mathbf{X}\mathbf{y}+\beta^{T}\mathbf{X}^{T}\mathbf{X}\beta\] and it is strictly convex if and only if its Hessian matrix \(\mathbf{X}^{T}\mathbf{X}\) is invertible.
The gradient of \(\|\mathbf{y}-\mathbf{X}\beta\|^{2}\) is \(\frac{1}{n}\mathbf{X}^{T}(\mathbf{X}\beta-\mathbf{y})=0\) which is known as the normal equation \[\mathbf{X}^{T}\mathbf{X}\beta=\mathbf{X}^{T}\mathbf{y}\]
If \(\mathbf{X}^{T}\mathbf{X}\) is invertible, then the optimal weighting vector is \(\hat{\beta}=(\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T}\mathbf{y}\)

Linear Regression

Differentiating \(\ell(\beta,\sigma^{2})\) w.r.t. \(\sigma^{2}\) \[\nabla_{\sigma^{2}}\ell(\beta,\sigma^{2})=\frac{n}{2\sigma^{2}}-\frac{n}{2\sigma^{4}}\frac{1}{n}\sum_{i=1}^{n}(y_{i}-\beta^{T}\mathbf{x}_{i})^{2}.\] gives \(\hat{\sigma}^{2}=n^{-1}\sum_{i=1}^{n}(y_{i}-\beta^{T}\mathbf{x}_{i})^{2}\).

Logit Function and logistic function

Logit	Logistic

Logit Function and logistic function

\[\eta =\mbox{logit}(\theta) = \log \left(\frac{\theta}{1-\theta} \right),\]

\[\theta = \mbox{logit}^{-1}(\eta) = \frac{1}{1 + \exp(-\eta)} = \frac{\exp(\eta)}{ \exp(\eta) + 1}.\]

A class of functions (GLMs) \(f\) and \(f^{-1}\) may serve the same purpose, namely linking two types of parameters who are at different domains such as \(\theta \in (0,1)\) and \(\eta \in (-\infty, \infty)\).
\(\theta\) may serve the mean of \(Y\) and \(\eta\) may serve the mean of \(\beta X\).
Consider \(\mbox{logit}^{-1}\) function as \(f(\cdot)\), \(\mbox{logit}\) as \(f^{-1}(\cdot)=g(\cdot)\) called the link function.

Nonlinear Regression

\(X\in\mathbb{R}^{d}\)
\(Y\in\lbrace0,1\rbrace\) and \(Y\sim \mbox{Ber}(\theta)\).
Logistic function (inverse logit): connect \(\eta\) with \(\theta\in (0,1)\):\[\theta=f(\eta)=\frac{1}{1+e^{-\eta}}\mbox{ for any }\eta\in\mathbb{R}.\]
Properties: \(f(-\eta)=1-f(\eta)\) and \(f'(\eta)=f(\eta)(1-f(\eta))\).
Observation \(\mathbf{x}\). Output \(Y\) conditional on \(X=\mathbf{x}\)
Bernoulli distribution with parameter \(\theta=f(\beta^{T}\mathbf{x})\), where \(\beta\) is a \(d\)-dimensional weighting vector.

Model Generation and Simulated Data

\(N\) individuals have the same probability \(\pi\) of experiencing some event.
The number of events counted/successed \(Y\) follows a binomial distribution. (A single person: Bernoulli distribution.)
Model the number of counted events \(Y_t\) among all individuals \(N_t\) in year \(t\) for a total \(36\) years.
Function \(\pi_t=f(\beta^T\mathbf{x}_t)\) links \(\pi_t\) with \(\beta^T\mathbf{x}_t\). Also the inverse \(f^{-1}(\pi)=\beta^T\mathbf{x}_t\) as it exists \[\beta^T\mathbf{x}_t = \log\left( \frac{\pi_t}{1-\pi_t} \right)\]

Model Generation and Simulated Data

Random counted numbers \[Y_t \sim \mbox{Bin}(N, \pi_t)\]
Link the parameter \(\pi_t\) with systematic part \(\eta_t =\beta^T\mathbf{x}_t\) by the logit link function: \(\eta_t = \log\left( \frac{\pi_t}{1-\pi_t} \right)\)
The system: \[\eta_t =\beta^T\mathbf{x}_t = \beta_0 + \beta_1 x_t + \beta_2 x_t^2 \in \mathbb{R} .\]
Here, \(\pi\) is the expected proportion of success. It is the mean response for each of the \(N\) trials. \(\eta_t\) is that same proportion on the (logit) link scale.

Model Generation and Simulated Data

nyears = 36; beta0= 0; beta1 = -0.1; beta2 = -0.9;
year = 1:nyears;  x = (year-round(nyears/2)) / (nyears / 2)
N = round(runif(nyears, min = 20, max=100))
eta = beta0 + beta1 * x + beta2 * x^2
pi = plogis(eta) 
y = rbinom(nyears, N, pi)

Try by yourself

x = year
pi = plogis(eta, 0, mean(eta))

Model Generation and Simulated Data

plot(year, y/N, type="b"); points(year, pi, type="l", col="red")

Parametric Models

A model looks something like \[y= f(x,\beta).\]
\(y\): a response, something that our study system has produced and whose genesis we would like to understand
The response is a function \(f\) of one or several explanatory variables \(x\) and of some system descriptors, or parameters, \(\beta\).
Modeling means to replace a complicated reality of very large dimension with a much smaller set of system descriptors called paramters \(\theta\).

Parametric Models with Parametric Satatistical Models

A model looks something like \[Y= f(x,\beta) + \varepsilon \mbox{ with } \varepsilon \sim p_{\theta}\]
\(\varepsilon\): part of the response that is not explained by the functional form of the model \(f\), the explanatory variables \(x\) but by its parameter \(\theta\).
\(f\): a function describes the unexplained part using parameter \(\beta\).
Compactly expression: \(Y \sim p(x,\beta,\theta)\)

Parametric Models with Parametric Satatistical Models

The model means: response \(=\) systematic \(+\) stochastic
The previous modeling suggestion is one option.
We can consider a hierarchical structure: \[Y \sim p(\eta,\theta), \, \eta \sim p(x, \beta), \mbox{ and } X \sim p \]
At the bottom level of the hierachy, \(x\) is a realization of a random variable distributed as \(p\) without unknowns.
At the upper level, \(Y\) is distributed as \(p(\eta, \theta)\) given \(\eta=\beta^{T}x\).

Generalized Linear Models

The generalized linear model (GLM) extends the concept of a linear model with normally distribution \(Y\) to many other response distributions.
It includes the Poisson, binomial, Gaussian, multinomial, and many others.
Directly connecting non-Gaussian \(Y\) to \(f(x,\beta)\) would typically result in inadmissible values.
For example, \(Y=\{0,1\}\) but \(\beta^T\mathbf{x}\) is continuous.

Generalized Linear Models

The big advantage of GLM is that we can apply a linear model \(\eta_i =\beta^T\mathbf{x}_i\) to the response \(Y_i\) indirectly: by a transformation of its parameters, e.g. the mean response \(\theta_i=\mathbb{E}[Y]\) for Poisson variable \(Y\).
The function that makes the transformation is called the link function.

GLM

The response \(Y_i\) has a statistical distribution \(p\) parameterized by the mean response \(\theta_i\): \[Y_i \sim p(\theta_i).\]
A link function \(g\): \[g(\theta_i) =\eta_i\]
Systematic part of the response is a linear regression:\[ \eta_i = \beta^T \mathbf{x}_i.\]

GLM

GLM is summarized in two lines: \[\begin{align} Y_i \sim & \, p(\theta_i) \\ g(\theta_i) =& \beta^T \mathbf{x}_i \end{align}\]

Link Function

Links for binomial example is the logistic function \[\beta^T \mathbf{x}_i = \log\left( \frac{\theta_i}{1-\theta_i} \right), \, \theta_i =\frac{1}{1+e^{-\beta^T \mathbf{x}_i}}.\]
Links for normal case is identity \(\theta_i = \beta^T \mathbf{x}_i.\)
Links for Poissson case is \[\log(\theta_i) =\beta^T \mathbf{x}_i \, \theta_i = \exp{\beta^T \mathbf{x}_i}.\]

Poisson GLM

Counting number of producted goods.
At each year, the productivity is different.
Productivity may increase across the years but may not be strictly monotone
Productivity increase faster at the beginning.

Poisson GLM

A differential equation: \[ \frac{d\theta(t)}{dt}= \eta'(t) \theta(t), \mbox{ with } \theta(0)=0.\] The solution is \(\theta(t)= e^{\eta(t)}\).
The productivity \(Y_t\) follows:\(Y_i \sim \mbox{Poi}(\theta_t).\)
Link function: \(\theta_t = e^{\eta_t}\)
Cubic polynomial growth: \[\eta_t = \beta_0 + \beta_1 x_t + \beta_2 x_t^2 + \beta_3 x_t^3.\]

Poisson GLM

nyears = 36; year = 1:nyears;  
beta0 = 3.5; beta1=-0.1; beta2 = 0.009; beta3 = -0.00014
eta = beta0 + beta1 * year + beta2 * year^2 + beta3 * year^3
theta = exp(eta); y = rpois(nyears, theta)

Poisson GLM

plot(year, y, type="b", ylab="Productivity", xlab="Year"); 
lines(year, theta, type="l", col="red")

Poisson GLM

product = glm(y ~ year + I(year^2) + I(year^3), family=poisson); 
summary(product)$coefficients

##                  Estimate   Std. Error   z value      Pr(>|z|)
## (Intercept)  3.7779415813 1.206720e-01 31.307524 3.686172e-215
## year        -0.1469553144 2.573805e-02 -5.709652  1.132072e-08
## I(year^2)    0.0111125511 1.482214e-03  7.497264  6.516385e-14
## I(year^3)   -0.0001679631 2.460346e-05 -6.826808  8.682462e-12