In an OLS regression model, the dependent variable \(y\) is continuous. That is, the estimated value of \(y\) can take an infinite amount of values. However, in dealing with count data, ordinary least squares resgression does not produce the smallest variance. Count data is when the outcome variable \(y\) are actual counts. That is, \(y = 0, 1, 2, 3, \ldots\). Examples may include:
When OLS fails with count data, a technique called Poisson Regression can be used to model count data. In this example, we will apply a Poisson Regression to determine how a country’s GDP and population impacts the count of Olympic medals throughout history.
For the Poission Regresion model, we assume that the probability distribution of the data is a Poisson distribution and not Gaussian or logistic. If \(Y\) is a Poisson random variable, then its correspondding probability function is
\[ \begin{aligned} f(y) = P(Y=y) = \frac{e^{-\lambda}\lambda^{y}}{y!}, \qquad y = 0, 1, 2, 3, 4, \ldots &&&& (1) \end{aligned} \]
The probability function has only one parameter \(\lambda\), which is the mean and variance of \(Y\). That is to say, \(\mathbf{E}(Y) = \mathbf{var}(Y) = \lambda\). Similar to a regression model, we will try to explain the behavior of \(\mathbf{E}(Y)\) as a function of some explanatory variables. In modeling count data, we achieve this by
\[ \begin{aligned} \mathbb{E}(Y) = \lambda = \exp{(\beta_{0} + \beta_{1}x)} &&&& (2) \end{aligned} \]
Equation (2) defines the Poisson regression model.
The parameters \(\beta_{0}\) and \(\beta_{1}\) can be estimated by maximum likelihood. Let us suppose that we randomly select \(N = 3\) individuals from a population and observe that their counts are \(y_{1} = 0, y_{2} = 2, y_{3} = 2\), where the values \(0, 2, 2\) indicate the occurrences of an event for these three individuals. The likelihood function is the joint probability function of the observed data, which is interpreted as a function of the unknown parameters, as shown below
\[ \begin{aligned} L(\beta_{0}, \beta_{1}) = P(Y = 0) \times P(Y = 2) \times P(Y = 2) \end{aligned} \]
The product of the functions like equation (1) will be very complicated and quite difficult to maximize. However, in practice, maximum likelihood estimation is carried out by maximizing the logartihm of the likelihood function
\[ \begin{aligned} \ln{L(\beta_{0}, \beta_{1})} = \ln{P(Y=0)} + \ln{P(Y=2)} + \ln{P(Y=2)} \end{aligned} \]
Using equation (2) for \(\lambda\), the log of the probability function is
\[ \begin{aligned} \ln{[P(Y=y)]} & = -\lambda + y\ln(\lambda) - \ln{y!} \\ & = -\exp{(\beta_{0}+\beta_{1}x)} + y \times (\beta_{0} +\beta_{1}x) - \ln{(y!)} \end{aligned} \]
The log-likelihood function, given \(N=3\), then becomes
\[ \begin{aligned} \ln{L(\beta_{0}, \beta_{1})} = \sum_{i=1}^{N}\{-\exp{(\beta_{0} + \beta_{1}x)} + y_{i} \times (\beta_{0}+\beta_{1}x_{i}) - \ln{(y!)}\} \end{aligned} \]
Prediction of the conditional mean of \(y\) is fairly straightforward. Given the maximum likelihood estimates \(b_{0}\) and \(b_{1}\), and given the value of the explanatory variable \(x_{0}\),
\[ \begin{aligned} \widehat{\mathbf{E}(y_{0})} = \hat{\lambda}_{0} = \exp{(b_{0} + b_{1}x_{0})} \end{aligned} \]
This value is an estimate of the expected number of cocurrences observed if \(x\) takes the value \(x_{0}\). The probability of a particular number of occurrences can be estimated by inserting the estimated conditional mean into the probability function as
\[ \begin{aligned} \widehat{P(Y=y)} = \frac{\exp{(-\hat{\lambda}_{0})\hat{\lambda}_{0}^{y}}}{y!}, \qquad y = 0,1,2,\ldots \end{aligned} \]
We can compute the marginal effect of \(\mathbf{E}(y_{i})\) with respect to \(x_{i}\) as
\[ \begin{aligned} \frac{\partial \mathbf{E}(y_{i})}{\partial x_{i}} = \lambda_{i}\beta_{1} &&&& (3) \end{aligned} \]
Equation (3) can be expressed as a percentage, which can be usueful
\[ \begin{aligned} \frac{\% \Delta \mathbf{E}(y)}{\Delta x_{i}} = 100 \frac{\partial \mathbf{E}(y_{i}) / \mathbf{E}(y_{i})}{\partial x_{i}} = 100 \beta_{1}\% \end{aligned} \]
USA! USA! Why does the USA smash in every Olympics, winter or suummer? Intuitively, GDP and population should greatly impact the medal count for a country every Olympics, since higher GDP means more resources at your disposal and a larger population means a larger talent pool. Let’s see what, if any, effects GDP and population has on Olympic medal counts.
library(foreign)
## Warning: package 'foreign' was built under R version 3.4.1
olympics <- read.dta('/Users/czar.yobero/Documents/stata/olympics.dta')
olympics <- na.omit(olympics)
kable(head(olympics), format = 'html') %>%
kable_styling(bootstrap_options = c('striped', 'hover'))
| country | year | gdp | pop | gold | silver | bronze | medaltot | host | planned | soviet | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 1 | 80 | 2.66e+10 | 16000000 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 6 | 3 | 80 | 3.16e+10 | 18700000 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 7 | 3 | 84 | 3.85e+10 | 21200000 | 0 | 0 | 2 | 2 | 0 | 0 | 0 |
| 9 | 6 | 80 | 4.99e+09 | 7019000 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 11 | 7 | 84 | 2.48e+08 | 61800 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 13 | 8 | 84 | 2.31e+11 | 29900000 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Let’s create scatter plots to see if such a positive relationship exists.
library(gridExtra)
gdp.plot <- ggplot(olympics, aes(y = medaltot, x = log(gdp))) + geom_point(col = 'blue') + geom_smooth(col = 'red')
pop.plot <- ggplot(olympics, aes(y = medaltot, x = log(pop))) + geom_point(col = 'blue') + geom_smooth(col = 'tomato2')
grid.arrange(gdp.plot, pop.plot, ncol = 2)
The scatter plots show that there does seem to exist a positive relationship between medal totals and gdp and population.
Let’s take the natural logarithm of GDP and population given their relative large size to make it more linear. Then we will fit our mnodel.
olympics.fit <- glm(medaltot ~ log(gdp) + log(pop), family = poisson(link=log), data = olympics)
summary(olympics.fit)
##
## Call:
## glm(formula = medaltot ~ log(gdp) + log(pop), family = poisson(link = log),
## data = olympics)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -10.0661 -1.7796 -1.0166 -0.3793 21.6868
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -16.073819 0.173248 -92.78 <2e-16 ***
## log(gdp) 0.569965 0.008694 65.56 <2e-16 ***
## log(pop) 0.207992 0.011786 17.65 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 23697.4 on 1252 degrees of freedom
## Residual deviance: 8983.2 on 1250 degrees of freedom
## AIC: 10689
##
## Number of Fisher Scoring iterations: 6
# Predictions
olympics$yhat <- round(olympics.fit$fitted.values, 0)
kable(head(olympics, n = 50), format = 'html') %>%
kable_styling(bootstrap_options = c('striped', 'hover'))
| country | year | gdp | pop | gold | silver | bronze | medaltot | host | planned | soviet | yhat | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 1 | 80 | 2.66e+10 | 1.600000e+07 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 3 |
| 6 | 3 | 80 | 3.16e+10 | 1.870000e+07 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 3 |
| 7 | 3 | 84 | 3.85e+10 | 2.120000e+07 | 0 | 0 | 2 | 2 | 0 | 0 | 0 | 4 |
| 9 | 6 | 80 | 4.99e+09 | 7.019000e+06 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 11 | 7 | 84 | 2.48e+08 | 6.180000e+04 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 13 | 8 | 84 | 2.31e+11 | 2.990000e+07 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 11 |
| 16 | 11 | 80 | 2.31e+11 | 1.470000e+07 | 2 | 2 | 5 | 9 | 0 | 0 | 0 | 10 |
| 17 | 11 | 84 | 2.58e+11 | 1.550000e+07 | 4 | 8 | 12 | 24 | 0 | 0 | 0 | 10 |
| 18 | 12 | 84 | 1.76e+11 | 7.552000e+06 | 1 | 1 | 1 | 3 | 0 | 0 | 0 | 7 |
| 19 | 12 | 80 | 1.67e+11 | 7.553000e+06 | 1 | 2 | 1 | 4 | 0 | 0 | 0 | 7 |
| 22 | 14 | 84 | 3.06e+09 | 2.290000e+05 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 24 | 15 | 84 | 3.93e+09 | 4.073300e+05 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 27 | 16 | 84 | 2.38e+10 | 9.560000e+07 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 4 |
| 34 | 17 | 84 | 1.51e+09 | 2.526000e+05 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 39 | 19 | 84 | 2.19e+11 | 9.853000e+06 | 1 | 1 | 2 | 4 | 0 | 0 | 0 | 9 |
| 40 | 19 | 80 | 2.14e+11 | 9.847000e+06 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 9 |
| 41 | 20 | 84 | 3.00e+08 | 1.620000e+05 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 45 | 21 | 80 | 1.21e+09 | 3.464000e+06 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 46 | 21 | 84 | 1.41e+09 | 3.916360e+06 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 52 | 22 | 84 | 1.92e+09 | 5.560000e+04 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 54 | 23 | 84 | 1.60e+08 | 5.280800e+05 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 56 | 24 | 84 | 5.01e+09 | 5.780480e+06 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 59 | 26 | 80 | 1.41e+09 | 9.060000e+05 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 60 | 26 | 84 | 2.14e+09 | 1.043830e+06 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 66 | 27 | 84 | 5.06e+11 | 1.330000e+08 | 1 | 5 | 2 | 8 | 0 | 0 | 0 | 24 |
| 67 | 27 | 80 | 5.17e+11 | 1.220000e+08 | 2 | 0 | 2 | 4 | 0 | 0 | 0 | 24 |
| 74 | 30 | 80 | 1.18e+10 | 8.862000e+06 | 8 | 16 | 17 | 41 | 0 | 0 | 1 | 2 |
| 93 | 34 | 80 | 6.32e+09 | 8.655000e+06 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 94 | 34 | 84 | 9.14e+09 | 9.692440e+06 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 96 | 35 | 84 | 4.39e+11 | 2.570000e+07 | 10 | 18 | 16 | 44 | 0 | 0 | 0 | 16 |
| 101 | 37 | 84 | 3.36e+08 | 2.038924e+04 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 104 | 38 | 84 | 1.06e+09 | 2.545150e+06 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 109 | 39 | 84 | 8.78e+08 | 4.988890e+06 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 114 | 41 | 84 | 2.40e+11 | 1.040000e+09 | 15 | 8 | 9 | 32 | 0 | 1 | 0 | 24 |
| 121 | 42 | 80 | 4.60e+10 | 2.840000e+07 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 4 |
| 122 | 42 | 84 | 4.98e+10 | 3.100000e+07 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 5 |
| 127 | 44 | 84 | 9.22e+09 | 3.070000e+07 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 |
| 132 | 45 | 80 | 1.29e+09 | 1.669000e+06 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 133 | 45 | 84 | 2.13e+09 | 1.868080e+06 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 135 | 47 | 80 | 5.68e+09 | 2.284000e+06 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 136 | 47 | 84 | 5.71e+09 | 2.565010e+06 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 138 | 48 | 84 | 8.22e+09 | 9.532520e+06 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 1 |
| 141 | 50 | 80 | 1.68e+10 | 9.710000e+06 | 8 | 7 | 5 | 20 | 0 | 0 | 1 | 2 |
| 144 | 51 | 80 | 3.60e+09 | 6.110000e+05 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 145 | 51 | 84 | 4.53e+09 | 6.400000e+05 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 151 | 53 | 80 | 1.34e+11 | 5.123000e+06 | 2 | 1 | 2 | 5 | 0 | 0 | 0 | 6 |
| 152 | 53 | 84 | 1.46e+11 | 5.112000e+06 | 0 | 3 | 3 | 6 | 0 | 0 | 0 | 6 |
| 153 | 54 | 84 | 4.60e+08 | 3.666700e+05 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 158 | 56 | 80 | 7.55e+09 | 5.697000e+06 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 159 | 56 | 84 | 8.37e+09 | 6.235340e+06 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 1 |
Given the results, it seems that GDP and population in fact play an important factor in a country’s medal count. How do we interpret the results? Recall that the marginal effect of \(x\) is \(100\beta_{1}\%\). This means that a one-unit increase in GDP increases the medal count by 57% change in the conditional mean \(\mathbf{E}(y_{i})\), and a one-unit increase in population increases it by 21%.