Negative Binomial Regression

STATA Examples

Example 1

Version Information: All code in this page is tested on STATA version 14.2.

Build a regression model of Global_Sales as a function of Publisher and Genre.

Data Set Up

Since there are so many small scale publishers, we limit to publishers with over 200 games. To begin we get rid of some of the unnecessary variables for this example.

keep publisher platform global_sales na_sales user_score genre

We want to get rid of rows missing these values. It turns out that only one of these variables has missing values. So we run the following line of code to clean up the data.

drop if mi(user_score)

Now we want to limit the data set to publishers with more than 200 games. To do this we count the number of observations by publisher.

tabulate publisher

The code above outputs a list that is four pages long. This is because the data set has over 500 publishers. But, the vast majority of the publishers have less than ten games. For this reason, we shrink our data set to only look at large publishers.

From this list, we can easily read off the names of the publishers with more than 200 games. These top producing publishers are Activision, Atari, Capcom ,Electronic Arts, Konami Digital Entertainment, Namco Bandai Games, Nintendo, Sega, Sony Computer Entertainment, THQ, Take-Two Interactive, Ubisoft, and Warner Bros. Interactive Entertainment. So, we remove all other publishers from our dataset.

keep if publisher == "Activision" | publisher == "Atari" | publisher == "Capcom" | publisher == "Electronic Arts" | publisher == "Konami Digital Entertainment" | publisher == "Namco Bandai Games" | publisher == "Nintendo" | publisher == "Sega" | publisher == "Sony Computer Entertainment" | publisher == "THQ" | publisher == "Take-Two Interactive" | publisher == "Ubisoft" | publisher == "Warner Bros. Interactive Entertainment"

tabulate publisher

We can see that all the publishers listed have more than 200 games. And that there are far fewer than 500+ publishers.

Now that we the data cleanup and ready, we can begin the negative binomial regression. Now, STATA is going to complain that our predictor variables Genre and Pubisher are strings, when they need to be treated as factors, so we convert them.

encode publisher, gen(n_publisher)
encode genre, gen(n_genre)
compress

After we create these new factor variable versions, make sure to run compress to save space. Now we can run the negative binomial regression. Here is the one line input:

nbreg global_sales i.n_genre i.n_publisher

The output of this code should return something that looks like this:

Understand the Output

First, let’s explain what we are seeing in the output before we attempt to interpret it. The output first shows up that it fits a Poisson model followed by an intercept only model before finally fitting the negative binomial. The output also shows the log likelihood values for each model it creates but does not show us, this way we can see that the models are indeed getting better with each iteration.

Next, we see how many observations were used in the creation of the model (5,967 in the case shown above). Below the number of observations is the Wald chi-square statistic with 23 degrees of freedom. This can be used when testing for independence of row and column variables. In the next row we have a p-value. This is the p-value for a test that all of the estimated coefficients in the model are equal to zero, thus this tests the model as a whole. The next line is a pseudo \(R^{2}\).

The rest of the output is information about the coefficients of the negative binomial model. As an example, let’s look at the row beginning with Simulation. This is the expected difference in log count between Simulation and the reference level (Action) for the Genres in the model. The value of -.220 means that the expected log count for Simulation is .220 lower than the expected log count for Action. The next column tells us the standard error of the coefficient. We then get a z value and corresponding p-value. These two columns relate to the last two which are a 95% confidence interval. A single level of a variable is considererd significant at the 95% confidence level if the confidence interval does not contain 0.

Interpret the Output

Now, we can look at the output and draw conclusions. First, we see the the overall p-value of the model is very small, so this model is significant. Also, we can see that for all options under Publisher the coefficients are considered significantly different from zero at the 95% confidence level. We know this since for all of the publishers, zero is not included in the 95% confidence intervals for each of the coefficients. Looking at Genre, most of the coefficients are significantly different from zero at the 95% confidence level. The exceptions are the following: Misc, Platform, Racing, and Role-Playing. Theses genres have zero contained in the 95% confidence intervals for the coefficients. Note that the reference levels are Activiation (for Publisher) and Action (for Genre).

The Margins Command

We can use the margins command to better understand the binomial regression model. The margins command calculates the predicted counts at each level of Publisher, holding the other variables constant, in this case all levels of Genre at their respective means.

margins n_publisher, atmeans

When the code above is run, this is the output:

From the output, we can see that the predicted number of events for the publisher Activision is 0.73. It is easy to read from the above table that the publisher with the largest predicted number of events is Nintendo with 2.94, holding all Genre information constant. We run the same code for Genres, holding all the Publishers constant at their means.

margins n_genre, atmeans

From the above output table, it is easy to see that the genre with the largest predicted number of events is Shooter Games with 1.00, holding all publisher information constant. The genre with the smallest predicted number of events is Strategry Games with only 0.36, holding publisher constant.

The Form of the Binomial Regression Model

The Binomial Regression Model equation is similar to that of Poission Regression. In both cases, the log of the response variable is a linear combination of the predictor variables:

\(log(Global\ Sales) = \hat{\beta_{0}} + \hat{\beta_{1}}(Genre = Adventure) + \hat{\beta_{2}}(Genre = Fighting) + ... + \hat{\beta_{22}}(Publisher = Ubisoft) + \hat{\beta_{23}}(Publisher = Warner Bros. Ineractive Entertainment)\)

This implies:

\(Global\ Sales = e^{\hat{\beta_{0}} + \hat{\beta_{1}}(Genre = Adventure) + \hat{\beta_{2}}(Genre = Fighting) + ... + \hat{\beta_{22}}(Publisher = Ubisoft) + \hat{\beta_{23}}(Publisher = Warner Bros. Interactive Entertainment)}\)

\(Global\ Sales = e^{\hat{\beta_{0}}}*e^{\hat{\beta_{1}}(Genre = Adventure)}*e^{\hat{\beta_{2}}(Genre = Fighting)}*...*e^{\hat{\beta_{22}}(Publisher = Ubisoft)}*e^{\hat{\beta_{23}}(Publisher = Warner Bros. Interactive Entertainment)}\)

What this tells us when we are in the log scale, the coefficients of the predictors have an additive effect on the value of the response. But, once we get out of the log scale, now the coefficients of the predicitors have a multiplicative effect on the response.

Example 2

Version Information: All code in this page is tested on STATA version 14.2.

Build a regression model of NA_Sales as a function of Publisher and User_Score.

Data Set Up

Once again, we want to limit our analysis to publishers with over 200 games. We do the same initial data set up as in Example 1. So, we just skip ahead to after we recode the publisher and genre. We need to recode User_Score. It is the same process that we did with publisher and genre.

encode user_score, gen(n_user_score)
compress

After we create this new factor variable, make sure to run compress again to save space. Now we can run the negative binomial regression. Here is the one line input:

nbreg na_sales i.n_user_score i.n_publisher

The output of this code should return something that looks like this:

Interpret the Output

Now, we can look at the output and draw conclusions. First, we see the the overall p-value of the model is very small, so this model is significant. Also, we can see that for most of the options under Publisher the coefficients are considered significantly different from zero at the 95% confidence level. The exceptions are Electronic Arts, Sony Computer Entertainment, Take-Two Interactive, and Warner Bros. Interactive Entertainment. We know this since for these four publishers, zero is included in the 95% confidence intervals for each of the coefficients.

The Margins Command

margins n_publisher, atmeans

When the code above is run, this is the output:

From the output, we can see that the predicted number of events for the publisher Activision is 0.45. It is easy to read from the above table that the publisher with the largest predicted number of events is Nintendo with 0.86, holding all Genre information constant. The publisher with the smallest predicted number of events is Konami Digital Entertainment with 0.17.

Negative Binomial Regression

Addison Shay

December 7, 2017

STATA Examples

Example 1

Build a regression model of Global_Sales as a function of Publisher and Genre.

Data Set Up

Understand the Output

Interpret the Output

The Margins Command

The Form of the Binomial Regression Model

Example 2

Build a regression model of NA_Sales as a function of Publisher and User_Score.

Data Set Up

Interpret the Output

The Margins Command