Introduction

In an OLS regression model, the dependent variable \(y\) is continuous. That is, the estimated value of \(y\) can take an infinite amount of values. However, in dealing with count data, ordinary least squares resgression does not produce the smallest variance. Count data is when the outcome variable \(y\) are actual counts. That is, \(y = 0, 1, 2, 3, \ldots\). Examples may include:

The number of children in a household.
The number of televisions in a household.
The number of students in a classroom.

When OLS fails with count data, a technique called Poisson Regression can be used to model count data. In this example, we will apply a Poisson Regression to determine how a country’s GDP and population impacts the count of Olympic medals throughout history.

The Poisson Regression Model

For the Poission Regresion model, we assume that the probability distribution of the data is a Poisson distribution and not Gaussian or logistic. If \(Y\) is a Poisson random variable, then its correspondding probability function is

\[ \begin{aligned} f(y) = P(Y=y) = \frac{e^{-\lambda}\lambda^{y}}{y!}, \qquad y = 0, 1, 2, 3, 4, \ldots &&&& (1) \end{aligned} \]

The probability function has only one parameter \(\lambda\), which is the mean and variance of \(Y\). That is to say, \(\mathbf{E}(Y) = \mathbf{var}(Y) = \lambda\). Similar to a regression model, we will try to explain the behavior of \(\mathbf{E}(Y)\) as a function of some explanatory variables. In modeling count data, we achieve this by

\[ \begin{aligned} \mathbb{E}(Y) = \lambda = \exp{(\beta_{0} + \beta_{1}x)} &&&& (2) \end{aligned} \]

Equation (2) defines the Poisson regression model.

Estimating Paramaters Using Maximum Likelihood Estimation

The parameters \(\beta_{0}\) and \(\beta_{1}\) can be estimated by maximum likelihood. Let us suppose that we randomly select \(N = 3\) individuals from a population and observe that their counts are \(y_{1} = 0, y_{2} = 2, y_{3} = 2\), where the values \(0, 2, 2\) indicate the occurrences of an event for these three individuals. The likelihood function is the joint probability function of the observed data, which is interpreted as a function of the unknown parameters, as shown below

\[ \begin{aligned} L(\beta_{0}, \beta_{1}) = P(Y = 0) \times P(Y = 2) \times P(Y = 2) \end{aligned} \]

The product of the functions like equation (1) will be very complicated and quite difficult to maximize. However, in practice, maximum likelihood estimation is carried out by maximizing the logartihm of the likelihood function

\[ \begin{aligned} \ln{L(\beta_{0}, \beta_{1})} = \ln{P(Y=0)} + \ln{P(Y=2)} + \ln{P(Y=2)} \end{aligned} \]

Using equation (2) for \(\lambda\), the log of the probability function is

\[ \begin{aligned} \ln{[P(Y=y)]} & = -\lambda + y\ln(\lambda) - \ln{y!} \\ & = -\exp{(\beta_{0}+\beta_{1}x)} + y \times (\beta_{0} +\beta_{1}x) - \ln{(y!)} \end{aligned} \]

The log-likelihood function, given \(N=3\), then becomes

\[ \begin{aligned} \ln{L(\beta_{0}, \beta_{1})} = \sum_{i=1}^{N}\{-\exp{(\beta_{0} + \beta_{1}x)} + y_{i} \times (\beta_{0}+\beta_{1}x_{i}) - \ln{(y!)}\} \end{aligned} \]

Interpreting the Poission Regression Model

Prediction of the conditional mean of \(y\) is fairly straightforward. Given the maximum likelihood estimates \(b_{0}\) and \(b_{1}\), and given the value of the explanatory variable \(x_{0}\),

\[ \begin{aligned} \widehat{\mathbf{E}(y_{0})} = \hat{\lambda}_{0} = \exp{(b_{0} + b_{1}x_{0})} \end{aligned} \]

This value is an estimate of the expected number of cocurrences observed if \(x\) takes the value \(x_{0}\). The probability of a particular number of occurrences can be estimated by inserting the estimated conditional mean into the probability function as

\[ \begin{aligned} \widehat{P(Y=y)} = \frac{\exp{(-\hat{\lambda}_{0})\hat{\lambda}_{0}^{y}}}{y!}, \qquad y = 0,1,2,\ldots \end{aligned} \]

We can compute the marginal effect of \(\mathbf{E}(y_{i})\) with respect to \(x_{i}\) as

\[ \begin{aligned} \frac{\partial \mathbf{E}(y_{i})}{\partial x_{i}} = \lambda_{i}\beta_{1} &&&& (3) \end{aligned} \]

Equation (3) can be expressed as a percentage, which can be usueful

\[ \begin{aligned} \frac{\% \Delta \mathbf{E}(y)}{\Delta x_{i}} = 100 \frac{\partial \mathbf{E}(y_{i}) / \mathbf{E}(y_{i})}{\partial x_{i}} = 100 \beta_{1}\% \end{aligned} \]

An Example

USA! USA! Why does the USA smash in every Olympics, winter or suummer? Intuitively, GDP and population should greatly impact the medal count for a country every Olympics, since higher GDP means more resources at your disposal and a larger population means a larger talent pool. Let’s see what, if any, effects GDP and population has on Olympic medal counts.

library(foreign)

## Warning: package 'foreign' was built under R version 3.4.1

olympics <- read.dta('/Users/czar.yobero/Documents/stata/olympics.dta')
olympics <- na.omit(olympics)
kable(head(olympics), format = 'html') %>% 
  kable_styling(bootstrap_options = c('striped', 'hover'))

	country	year	gdp	pop	bronze	medaltot
1	1	80	2.66e+10	16000000	0	0
6	3	80	3.16e+10	18700000	0	0
7	3	84	3.85e+10	21200000	2	2
9	6	80	4.99e+09	7019000	0	0
11	7	84	2.48e+08	61800	0	0
13	8	84	2.31e+11	29900000	0	0

Let’s create scatter plots to see if such a positive relationship exists.

library(gridExtra)
gdp.plot <- ggplot(olympics, aes(y = medaltot, x = log(gdp))) + geom_point(col = 'blue') + geom_smooth(col = 'red')
pop.plot <- ggplot(olympics, aes(y = medaltot, x = log(pop))) + geom_point(col = 'blue') + geom_smooth(col = 'tomato2')

grid.arrange(gdp.plot, pop.plot, ncol = 2)

The scatter plots show that there does seem to exist a positive relationship between medal totals and gdp and population.

Let’s take the natural logarithm of GDP and population given their relative large size to make it more linear. Then we will fit our mnodel.

olympics.fit <- glm(medaltot ~ log(gdp) + log(pop), family = poisson(link=log), data = olympics)
summary(olympics.fit)

## 
## Call:
## glm(formula = medaltot ~ log(gdp) + log(pop), family = poisson(link = log), 
##     data = olympics)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -10.0661   -1.7796   -1.0166   -0.3793   21.6868  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -16.073819   0.173248  -92.78   <2e-16 ***
## log(gdp)      0.569965   0.008694   65.56   <2e-16 ***
## log(pop)      0.207992   0.011786   17.65   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 23697.4  on 1252  degrees of freedom
## Residual deviance:  8983.2  on 1250  degrees of freedom
## AIC: 10689
## 
## Number of Fisher Scoring iterations: 6

# Predictions
olympics$yhat <- round(olympics.fit$fitted.values, 0)
kable(head(olympics, n = 50), format = 'html') %>%
  kable_styling(bootstrap_options = c('striped', 'hover'))

	country	year	gdp	pop	gold	silver	bronze	medaltot	planned	soviet	yhat
1	1	80	2.66e+10	1.600000e+07	0	0	0	0	0	0	3
6	3	80	3.16e+10	1.870000e+07	0	0	0	0	0	0	3
7	3	84	3.85e+10	2.120000e+07	0	0	2	2	0	0	4
9	6	80	4.99e+09	7.019000e+06	0	0	0	0	0	0	1
11	7	84	2.48e+08	6.180000e+04	0	0	0	0	0	0	0
13	8	84	2.31e+11	2.990000e+07	0	0	0	0	0	0	11
16	11	80	2.31e+11	1.470000e+07	2	2	5	9	0	0	10
17	11	84	2.58e+11	1.550000e+07	4	8	12	24	0	0	10
18	12	84	1.76e+11	7.552000e+06	1	1	1	3	0	0	7
19	12	80	1.67e+11	7.553000e+06	1	2	1	4	0	0	7
22	14	84	3.06e+09	2.290000e+05	0	0	0	0	0	0	0
24	15	84	3.93e+09	4.073300e+05	0	0	0	0	0	0	0
27	16	84	2.38e+10	9.560000e+07	0	0	0	0	0	0	4
34	17	84	1.51e+09	2.526000e+05	0	0	0	0	0	0	0
39	19	84	2.19e+11	9.853000e+06	1	1	2	4	0	0	9
40	19	80	2.14e+11	9.847000e+06	1	0	0	1	0	0	9
41	20	84	3.00e+08	1.620000e+05	0	0	0	0	0	0	0
45	21	80	1.21e+09	3.464000e+06	0	0	0	0	0	0	0
46	21	84	1.41e+09	3.916360e+06	0	0	0	0	0	0	0
52	22	84	1.92e+09	5.560000e+04	0	0	0	0	0	0	0
54	23	84	1.60e+08	5.280800e+05	0	0	0	0	0	0	0
56	24	84	5.01e+09	5.780480e+06	0	0	0	0	0	0	1
59	26	80	1.41e+09	9.060000e+05	0	0	0	0	0	0	0
60	26	84	2.14e+09	1.043830e+06	0	0	0	0	0	0	0
66	27	84	5.06e+11	1.330000e+08	1	5	2	8	0	0	24
67	27	80	5.17e+11	1.220000e+08	2	0	2	4	0	0	24
74	30	80	1.18e+10	8.862000e+06	8	16	17	41	0	1	2
93	34	80	6.32e+09	8.655000e+06	0	0	0	0	0	0	1
94	34	84	9.14e+09	9.692440e+06	0	0	0	0	0	0	1
96	35	84	4.39e+11	2.570000e+07	10	18	16	44	0	0	16
101	37	84	3.36e+08	2.038924e+04	0	0	0	0	0	0	0
104	38	84	1.06e+09	2.545150e+06	0	0	0	0	0	0	0
109	39	84	8.78e+08	4.988890e+06	0	0	0	0	0	0	0
114	41	84	2.40e+11	1.040000e+09	15	8	9	32	1	0	24
121	42	80	4.60e+10	2.840000e+07	0	0	0	0	0	0	4
122	42	84	4.98e+10	3.100000e+07	0	1	0	1	0	0	5
127	44	84	9.22e+09	3.070000e+07	0	0	0	0	0	0	2
132	45	80	1.29e+09	1.669000e+06	0	0	0	0	0	0	0
133	45	84	2.13e+09	1.868080e+06	0	0	0	0	0	0	0
135	47	80	5.68e+09	2.284000e+06	0	0	0	0	0	0	1
136	47	84	5.71e+09	2.565010e+06	0	0	0	0	0	0	1
138	48	84	8.22e+09	9.532520e+06	0	1	0	1	0	0	1
141	50	80	1.68e+10	9.710000e+06	8	7	5	20	0	1	2
144	51	80	3.60e+09	6.110000e+05	0	0	0	0	0	0	0
145	51	84	4.53e+09	6.400000e+05	0	0	0	0	0	0	1
151	53	80	1.34e+11	5.123000e+06	2	1	2	5	0	0	6
152	53	84	1.46e+11	5.112000e+06	0	3	3	6	0	0	6
153	54	84	4.60e+08	3.666700e+05	0	0	0	0	0	0	0
158	56	80	7.55e+09	5.697000e+06	0	0	0	0	0	0	1
159	56	84	8.37e+09	6.235340e+06	0	0	1	1	0	0	1

Given the results, it seems that GDP and population in fact play an important factor in a country’s medal count. How do we interpret the results? Recall that the marginal effect of \(x\) is \(100\beta_{1}\%\). This means that a one-unit increase in GDP increases the medal count by 57% change in the conditional mean \(\mathbf{E}(y_{i})\), and a one-unit increase in population increases it by 21%.

Poisson Regression Model: What Impacts a Country’s Olympic Medal Count?

Czar Yobero

1/25/2018

Introduction

The Poisson Regression Model

Estimating Paramaters Using Maximum Likelihood Estimation

Interpreting the Poission Regression Model

An Example