Introduction

In an OLS regression model, the dependent variable \(y\) is continuous. That is, the estimated value of \(y\) can take an infinite amount of values. However, in dealing with count data, ordinary least squares resgression does not produce the smallest variance. Count data is when the outcome variable \(y\) are actual counts. That is, \(y = 0, 1, 2, 3, \ldots\). Examples may include:

When OLS fails with count data, a technique called Poisson Regression can be used to model count data. In this example, we will apply a Poisson Regression to determine how a country’s GDP and population impacts the count of Olympic medals throughout history.

The Poisson Regression Model

For the Poission Regresion model, we assume that the probability distribution of the data is a Poisson distribution and not Gaussian or logistic. If \(Y\) is a Poisson random variable, then its correspondding probability function is

\[ \begin{aligned} f(y) = P(Y=y) = \frac{e^{-\lambda}\lambda^{y}}{y!}, \qquad y = 0, 1, 2, 3, 4, \ldots &&&& (1) \end{aligned} \]

The probability function has only one parameter \(\lambda\), which is the mean and variance of \(Y\). That is to say, \(\mathbf{E}(Y) = \mathbf{var}(Y) = \lambda\). Similar to a regression model, we will try to explain the behavior of \(\mathbf{E}(Y)\) as a function of some explanatory variables. In modeling count data, we achieve this by

\[ \begin{aligned} \mathbb{E}(Y) = \lambda = \exp{(\beta_{0} + \beta_{1}x)} &&&& (2) \end{aligned} \]

Equation (2) defines the Poisson regression model.

Estimating Paramaters Using Maximum Likelihood Estimation

The parameters \(\beta_{0}\) and \(\beta_{1}\) can be estimated by maximum likelihood. Let us suppose that we randomly select \(N = 3\) individuals from a population and observe that their counts are \(y_{1} = 0, y_{2} = 2, y_{3} = 2\), where the values \(0, 2, 2\) indicate the occurrences of an event for these three individuals. The likelihood function is the joint probability function of the observed data, which is interpreted as a function of the unknown parameters, as shown below

\[ \begin{aligned} L(\beta_{0}, \beta_{1}) = P(Y = 0) \times P(Y = 2) \times P(Y = 2) \end{aligned} \]

The product of the functions like equation (1) will be very complicated and quite difficult to maximize. However, in practice, maximum likelihood estimation is carried out by maximizing the logartihm of the likelihood function

\[ \begin{aligned} \ln{L(\beta_{0}, \beta_{1})} = \ln{P(Y=0)} + \ln{P(Y=2)} + \ln{P(Y=2)} \end{aligned} \]

Using equation (2) for \(\lambda\), the log of the probability function is

\[ \begin{aligned} \ln{[P(Y=y)]} & = -\lambda + y\ln(\lambda) - \ln{y!} \\ & = -\exp{(\beta_{0}+\beta_{1}x)} + y \times (\beta_{0} +\beta_{1}x) - \ln{(y!)} \end{aligned} \]

The log-likelihood function, given \(N=3\), then becomes

\[ \begin{aligned} \ln{L(\beta_{0}, \beta_{1})} = \sum_{i=1}^{N}\{-\exp{(\beta_{0} + \beta_{1}x)} + y_{i} \times (\beta_{0}+\beta_{1}x_{i}) - \ln{(y!)}\} \end{aligned} \]

Interpreting the Poission Regression Model

Prediction of the conditional mean of \(y\) is fairly straightforward. Given the maximum likelihood estimates \(b_{0}\) and \(b_{1}\), and given the value of the explanatory variable \(x_{0}\),

\[ \begin{aligned} \widehat{\mathbf{E}(y_{0})} = \hat{\lambda}_{0} = \exp{(b_{0} + b_{1}x_{0})} \end{aligned} \]

This value is an estimate of the expected number of cocurrences observed if \(x\) takes the value \(x_{0}\). The probability of a particular number of occurrences can be estimated by inserting the estimated conditional mean into the probability function as

\[ \begin{aligned} \widehat{P(Y=y)} = \frac{\exp{(-\hat{\lambda}_{0})\hat{\lambda}_{0}^{y}}}{y!}, \qquad y = 0,1,2,\ldots \end{aligned} \]

We can compute the marginal effect of \(\mathbf{E}(y_{i})\) with respect to \(x_{i}\) as

\[ \begin{aligned} \frac{\partial \mathbf{E}(y_{i})}{\partial x_{i}} = \lambda_{i}\beta_{1} &&&& (3) \end{aligned} \]

Equation (3) can be expressed as a percentage, which can be usueful

\[ \begin{aligned} \frac{\% \Delta \mathbf{E}(y)}{\Delta x_{i}} = 100 \frac{\partial \mathbf{E}(y_{i}) / \mathbf{E}(y_{i})}{\partial x_{i}} = 100 \beta_{1}\% \end{aligned} \]

An Example

USA! USA! Why does the USA smash in every Olympics, winter or suummer? Intuitively, GDP and population should greatly impact the medal count for a country every Olympics, since higher GDP means more resources at your disposal and a larger population means a larger talent pool. Let’s see what, if any, effects GDP and population has on Olympic medal counts.

library(foreign)
## Warning: package 'foreign' was built under R version 3.4.1
olympics <- read.dta('/Users/czar.yobero/Documents/stata/olympics.dta')
olympics <- na.omit(olympics)
kable(head(olympics), format = 'html') %>% 
  kable_styling(bootstrap_options = c('striped', 'hover'))
country year gdp pop gold silver bronze medaltot host planned soviet
1 1 80 2.66e+10 16000000 0 0 0 0 0 0 0
6 3 80 3.16e+10 18700000 0 0 0 0 0 0 0
7 3 84 3.85e+10 21200000 0 0 2 2 0 0 0
9 6 80 4.99e+09 7019000 0 0 0 0 0 0 0
11 7 84 2.48e+08 61800 0 0 0 0 0 0 0
13 8 84 2.31e+11 29900000 0 0 0 0 0 0 0

Let’s create scatter plots to see if such a positive relationship exists.

library(gridExtra)
gdp.plot <- ggplot(olympics, aes(y = medaltot, x = log(gdp))) + geom_point(col = 'blue') + geom_smooth(col = 'red')
pop.plot <- ggplot(olympics, aes(y = medaltot, x = log(pop))) + geom_point(col = 'blue') + geom_smooth(col = 'tomato2')

grid.arrange(gdp.plot, pop.plot, ncol = 2)

The scatter plots show that there does seem to exist a positive relationship between medal totals and gdp and population.

Let’s take the natural logarithm of GDP and population given their relative large size to make it more linear. Then we will fit our mnodel.

olympics.fit <- glm(medaltot ~ log(gdp) + log(pop), family = poisson(link=log), data = olympics)
summary(olympics.fit)
## 
## Call:
## glm(formula = medaltot ~ log(gdp) + log(pop), family = poisson(link = log), 
##     data = olympics)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -10.0661   -1.7796   -1.0166   -0.3793   21.6868  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -16.073819   0.173248  -92.78   <2e-16 ***
## log(gdp)      0.569965   0.008694   65.56   <2e-16 ***
## log(pop)      0.207992   0.011786   17.65   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 23697.4  on 1252  degrees of freedom
## Residual deviance:  8983.2  on 1250  degrees of freedom
## AIC: 10689
## 
## Number of Fisher Scoring iterations: 6
# Predictions
olympics$yhat <- round(olympics.fit$fitted.values, 0)
kable(head(olympics, n = 50), format = 'html') %>%
  kable_styling(bootstrap_options = c('striped', 'hover'))
country year gdp pop gold silver bronze medaltot host planned soviet yhat
1 1 80 2.66e+10 1.600000e+07 0 0 0 0 0 0 0 3
6 3 80 3.16e+10 1.870000e+07 0 0 0 0 0 0 0 3
7 3 84 3.85e+10 2.120000e+07 0 0 2 2 0 0 0 4
9 6 80 4.99e+09 7.019000e+06 0 0 0 0 0 0 0 1
11 7 84 2.48e+08 6.180000e+04 0 0 0 0 0 0 0 0
13 8 84 2.31e+11 2.990000e+07 0 0 0 0 0 0 0 11
16 11 80 2.31e+11 1.470000e+07 2 2 5 9 0 0 0 10
17 11 84 2.58e+11 1.550000e+07 4 8 12 24 0 0 0 10
18 12 84 1.76e+11 7.552000e+06 1 1 1 3 0 0 0 7
19 12 80 1.67e+11 7.553000e+06 1 2 1 4 0 0 0 7
22 14 84 3.06e+09 2.290000e+05 0 0 0 0 0 0 0 0
24 15 84 3.93e+09 4.073300e+05 0 0 0 0 0 0 0 0
27 16 84 2.38e+10 9.560000e+07 0 0 0 0 0 0 0 4
34 17 84 1.51e+09 2.526000e+05 0 0 0 0 0 0 0 0
39 19 84 2.19e+11 9.853000e+06 1 1 2 4 0 0 0 9
40 19 80 2.14e+11 9.847000e+06 1 0 0 1 0 0 0 9
41 20 84 3.00e+08 1.620000e+05 0 0 0 0 0 0 0 0
45 21 80 1.21e+09 3.464000e+06 0 0 0 0 0 0 0 0
46 21 84 1.41e+09 3.916360e+06 0 0 0 0 0 0 0 0
52 22 84 1.92e+09 5.560000e+04 0 0 0 0 0 0 0 0
54 23 84 1.60e+08 5.280800e+05 0 0 0 0 0 0 0 0
56 24 84 5.01e+09 5.780480e+06 0 0 0 0 0 0 0 1
59 26 80 1.41e+09 9.060000e+05 0 0 0 0 0 0 0 0
60 26 84 2.14e+09 1.043830e+06 0 0 0 0 0 0 0 0
66 27 84 5.06e+11 1.330000e+08 1 5 2 8 0 0 0 24
67 27 80 5.17e+11 1.220000e+08 2 0 2 4 0 0 0 24
74 30 80 1.18e+10 8.862000e+06 8 16 17 41 0 0 1 2
93 34 80 6.32e+09 8.655000e+06 0 0 0 0 0 0 0 1
94 34 84 9.14e+09 9.692440e+06 0 0 0 0 0 0 0 1
96 35 84 4.39e+11 2.570000e+07 10 18 16 44 0 0 0 16
101 37 84 3.36e+08 2.038924e+04 0 0 0 0 0 0 0 0
104 38 84 1.06e+09 2.545150e+06 0 0 0 0 0 0 0 0
109 39 84 8.78e+08 4.988890e+06 0 0 0 0 0 0 0 0
114 41 84 2.40e+11 1.040000e+09 15 8 9 32 0 1 0 24
121 42 80 4.60e+10 2.840000e+07 0 0 0 0 0 0 0 4
122 42 84 4.98e+10 3.100000e+07 0 1 0 1 0 0 0 5
127 44 84 9.22e+09 3.070000e+07 0 0 0 0 0 0 0 2
132 45 80 1.29e+09 1.669000e+06 0 0 0 0 0 0 0 0
133 45 84 2.13e+09 1.868080e+06 0 0 0 0 0 0 0 0
135 47 80 5.68e+09 2.284000e+06 0 0 0 0 0 0 0 1
136 47 84 5.71e+09 2.565010e+06 0 0 0 0 0 0 0 1
138 48 84 8.22e+09 9.532520e+06 0 1 0 1 0 0 0 1
141 50 80 1.68e+10 9.710000e+06 8 7 5 20 0 0 1 2
144 51 80 3.60e+09 6.110000e+05 0 0 0 0 0 0 0 0
145 51 84 4.53e+09 6.400000e+05 0 0 0 0 0 0 0 1
151 53 80 1.34e+11 5.123000e+06 2 1 2 5 0 0 0 6
152 53 84 1.46e+11 5.112000e+06 0 3 3 6 0 0 0 6
153 54 84 4.60e+08 3.666700e+05 0 0 0 0 0 0 0 0
158 56 80 7.55e+09 5.697000e+06 0 0 0 0 0 0 0 1
159 56 84 8.37e+09 6.235340e+06 0 0 1 1 0 0 0 1

Given the results, it seems that GDP and population in fact play an important factor in a country’s medal count. How do we interpret the results? Recall that the marginal effect of \(x\) is \(100\beta_{1}\%\). This means that a one-unit increase in GDP increases the medal count by 57% change in the conditional mean \(\mathbf{E}(y_{i})\), and a one-unit increase in population increases it by 21%.