Negative binomial regression is used for modeling count variables, usually for modeling overdispersed count outcome variables. When people are modeling count variables, the first thing that comes to mind would be Poisson Regression. However, most of the time, we neglect an important assumption of Poisson distribution. That is, the Poisson distribution is parameterized by \(\lambda\), which happens to be both its mean and variance. In constrast, the distribution of counts will usually have a variance that is not equal to it’s mean. When this happens with a data set, it is not appropriate to assume a Poisson distribution. We also have to comment on whether we have under- or overdispersion, depending on if the variance is smaller or larger than the mean, respectively. Performing Poisson regression on count data that exhibits this behavior results in a model that doesn’t fit well. Negative Binomial Regression is an appropriate approach for these kind of conditions.
The data used in this tutorial is from Kaggle by Rush Kirubi which was motivated by Gregory Smith’s web scrape and extended with another web scrape from Metacritic. There are 16,719 game entries and 16 variables. These 16 variables are divided into nine continuous and six categorical variables. To increase the processing speed and save on memory space, we are going to reduce the data set and only keep variables of interest.
The data set being used is rather large and some of the variables are sparsley populated. Below are the basic steps for data cleaning that was done. The specifics of how this is done in each language, can be found in the respective tabs below.
For the first regression, we are using the following variables: Global_Sales, Genre, and Publisher. These do not have many missing observations, so the data set is large.
For the second regression, we use NA_Sales, Publisher, and User_Score. User_Score is a very sparse variable, so when we drop missing data points from the data set for this regression, our data set is smaller than the one used for the first regression.
A description of the variables used in the example regressions is given below:
Global_Sales - Total sales across the globe.
NA_Sales - Total sales in North America.
Publisher - Publisher of the game.
Genre - Genre of the game.
User_Score - Score given by Metacritic’s subscribers.
(The data can be accessed at https://www.kaggle.com/rush4ratio/video-game-sales-with-ratings).
Since we will only use Publisher, Global_sales, NA_Sales, User_score, and Genre, we will make data only with these 5 variables.
data vg_raw;
infile '/home/erickim50/stat 448/Video_Games_Sales.csv' dsd missover dlm = ',' firstobs = 2;
length Name Genre Publisher $50.;
input Name $
Platform $
Release
Genre $
Publisher $
NA_Sales
EU_Sales
JP_Sales
Other_Sales
Global_Sales
Critic_Score
Critic_Count
User_Score
User_Count
Developer $
Rating $;
keep Publisher Global_Sales NA_Sales User_Score Genre;
run;
proc sort data =vg_raw;
by Publisher;
run;
proc freq data= vg_raw nlevels;
table Genre Publisher/ noprint;
run;
When we checked the frequency of Publisher and Genre (categorical variables), we had a lot of different Publishers(581) and there were some publishers only with one or two frequencies. Therefore, we decide to delete all values with frequency of Publisher less than 400.
data vgsales;
set vg_raw;
if Publisher ne 'Activision'
and Publisher ne 'Electronic Arts'
and Publisher ne 'Konami Digital Entertainment'
and Publisher ne 'Namco Bandai Games'
and Publisher ne 'Nintendo'
and Publisher ne 'Sega'
and Publisher ne 'Sony Computer Entertainment'
and Publisher ne 'THQ'
and Publisher ne 'Take-Two Interactive'
and Publisher ne 'Ubisoft'
then delete;
run;
Negative binomial regression has similar equation with poisson regression. The only difference in negative binomial regression is that there is one parameter more than the Poisson regression that adjusts the variance independently from the mean:
\[ln(E(Global\ Sales)) = \hat{\beta_{0}} + \hat{\beta_{1}}(Genre = Adventure) + \hat{\beta_{2}}(Genre = Fighting) + ... + \hat{\beta_{21}}(Publisher = Take-Two Interactive) + \hat{\beta_{22}}(Publisher = Ubisoft)\]
This implies:
\[E(Global\ Sales) = e^{\hat{\beta_{0}}}*e^{\hat{\beta_{1}}(Genre = Adventure)}*e^{\hat{\beta_{2}}(Genre = Fighting)}*...*e^{\hat{\beta_{21}}(Publisher = Take-Two Interactive)}*e^{\hat{\beta_{22}}(Publisher = Ubisoft)}\]
– First Model : Global_Sales ~ Publisher, Genre
As we decided before, we decide not to use User_Score as response variable since user score does not seem to have overdispersion and it is not count variable. Since we are not using user score, we have to drop user_Score and delete all missing values from remaining 3 variables (Global Sales, Publisher, and Genre).
data vgsales1;
set vgsales;
drop User_score NA_Sales;
data vg_sale;
set vgsales1;
if cmiss(of _all_) then delete;
run;
Negative Binomial Regression
After Clearing the data, we used proc genmod to get negative binomial regression model.
proc genmod data = vg_sale;
class Publisher Genre;
model Global_Sales = Publisher Genre / type1 type3 dist=negbin;
Ods select ParameterEstimates Type1 Type3;
run;
After running the regression with using dummy variable as the first variable appeared in each predictors(Ubisoft for Publisher and Strategy for Genre), we found that some variables are not significant. However, this is reasonable since most of the variables are significant and overall model is significant (Type1, Type3 table). Insignificant does not mean that we cannot use specific variables.
By looking at the estimate (slope) of the variables, for Publisher, we found that Nintendo has the greatest estimate with 1.6116 and Namco Bandai Games had the lowest estimate with -0.6385. This mean, without thinking of intercept, Global_Sales will increase by 1.6116 with the increase of games from Nintendo. For Genre, we foun that Shooter game has the greatest estimates with 1.0943 and interestingly, there is only one variable with negative slope : Adventure. However, since the p-value fore Adventure is high, we cannot conclude that Global_Sales will always decrease when there happens increase in Adventure games.
– Second Model : NA_Sales ~ User_Score , Publisher
Second Model is to know which of the publisher have the most NA_Sales.
**Slightly different data used according to the one we used above because there are lot of missing values in User_Score.
Data Cleaning
data vg_sale;
set vgsales;
if cmiss(of _all_) then delete;
run;
First of all, We used same publishers from the first model so that we can compare our regressions easily. In second model, we will use 3 variables (User_Score, US_Sales, and Publisher) in our data and get rid of the missing values (total 4253 observations).
Now we run proc genmod to get our regression model.
proc genmod data = vg_sale;
class Publisher;
model NA_Sales =User_Score Publisher / type1 type3 dist=negbin;
output out = nb_predi predicted = predi1;
ods select ParameterEstimates Type1 Type3;
run;
We have all values significant and we can use this model to conduct which publisher have to most NA_Sales. By using sgplot, we can make graph to see the relationships.
proc sort data = nb_predi;
by predi1;
run;
proc sgplot data = nb_predi;
series x=User_Score y=predi1 / group = Publisher;
run;
The graph indicates that the most NA_Sales are predicted for those in Nintendo and the lowest number of predicted NA_Sales is for Namco Bandai games and Sony computer entertainment. This is really similar to what we had on our first model (without. When we got the estimate from the regression model, Nintendo showed the highest estimates and Namco Bandai Games showed the lowest estimates. This graph also shows Nintendo as highest and Bandai as the lowest. As a reference, below is the same graph for Global Sales instead of NA Sales.
In the front page, we already introduced the definition of negative binomial regression and the application conditions of it. In this document, we are going to apply the Negative Binomial Regression into a real-life problem. We will use the “Video Game Sales” data set to demonstrate how to fit a “Negative Binomial Regression” in R, and explain how to interpret the model. Also, We will briefly introduce the data cleaning process with R package tidyr and dplyr.
In this tutorial, we are going to use foreign and MASS packages for negative binomial regression and tidyverse package to clean the data. If those libraries are not installed in your device, you may use command install.packages('package name') to install then attach the desired package.
setwd("~/Fall 2017/STATS 506/Group Project")
library(foreign)
library(MASS)
library(tidyverse)
# load data
full_data <- read.csv('Video_Games_Sales.csv',header = TRUE)
dim(full_data)
## [1] 16719 16
The function filter and select is the command in dplyr package that we use to reduce the records and variables. filter is the function to filter the row that contains some information we want. select is the function to select variables. We will construct one data set for each regression example. One dataset with variables Global_Sales, Publisher, Genre called working_data1, and another dataset with variables NA_Sales, Publisher, and User_Score is called working_data2. Here is how to implement the data cleaning in R language, we will display some of the observations to show you how the data set looks like with descending order of sales.
# data cleaning
publishers_list = full_data %>%
group_by(Publisher) %>%
count(Publisher) %>%
filter(n>400)
## Warning: package 'bindrcpp' was built under R version 3.3.3
publishers_list
## # A tibble: 10 x 2
## # Groups: Publisher [10]
## Publisher n
## <fctr> <int>
## 1 Activision 985
## 2 Electronic Arts 1356
## 3 Konami Digital Entertainment 834
## 4 Namco Bandai Games 939
## 5 Nintendo 706
## 6 Sega 638
## 7 Sony Computer Entertainment 687
## 8 Take-Two Interactive 422
## 9 THQ 715
## 10 Ubisoft 933
working_data1 = full_data %>%
dplyr::select(Name, Global_Sales, Publisher, Genre) %>%
filter(Publisher %in% publishers_list$Publisher) %>%
arrange(desc(Global_Sales))
head(working_data1)
## Name Global_Sales Publisher Genre
## 1 Wii Sports 82.53 Nintendo Sports
## 2 Super Mario Bros. 40.24 Nintendo Platform
## 3 Mario Kart Wii 35.52 Nintendo Racing
## 4 Wii Sports Resort 32.77 Nintendo Sports
## 5 Pokemon Red/Pokemon Blue 31.37 Nintendo Role-Playing
## 6 Tetris 30.26 Nintendo Puzzle
working_data2 = full_data %>%
dplyr::select(Name, Publisher,NA_Sales,User_Score) %>%
filter(Publisher %in% publishers_list$Publisher,
User_Score != "", User_Score != "tbd") %>%
mutate(User_Score = as.numeric(User_Score)) %>%
arrange(desc(NA_Sales))
head(working_data2)
## Name Publisher NA_Sales User_Score
## 1 Wii Sports Nintendo 41.36 79
## 2 Mario Kart Wii Nintendo 15.68 82
## 3 Wii Sports Resort Nintendo 15.61 79
## 4 New Super Mario Bros. Wii Nintendo 14.44 83
## 5 Wii Play Nintendo 13.96 65
## 6 New Super Mario Bros. Nintendo 11.28 84
As the explanation of introduction, we are going to exam whether our dataset satisfied the Poisson regression by visualizing the dataset with descriptive statistics and plots.
# Data Visuallization.
summary(working_data1$Global_Sales)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0100 0.1000 0.2700 0.7626 0.6750 82.5300
sprintf("Mean and SD = %1.2f and %1.2f", mean(working_data1$Global_Sales), sd(working_data1$Global_Sales))
## [1] "Mean and SD = 0.76 and 2.07"
hist(working_data1$Global_Sales, main = "Histogram of Global Sales")
summary(working_data2$NA_Sales)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0900 0.2100 0.4821 0.4900 41.3600
sprintf("Mean and SD = %1.2f and %1.2f", mean(working_data2$NA_Sales), sd(working_data2$NA_Sales))
## [1] "Mean and SD = 0.48 and 1.12"
hist(working_data2$NA_Sales, main = "Histogram of North America Sales")
These results suggest that our dataset is dispersion and that a Negative Binomial model would be appropriate.
The package MASS contains a method glm.nb, which is a function specifically designed for the negative binomial regression. Next, we are going to demonstrate how to fit negative binomial regression in R with the first data subset, and briefly interpret the fitting result.
# Fit NA_Sales on variables Publisher and Genre.
m1 <- glm.nb(Global_Sales ~ Genre + Publisher, data = working_data1)
summary(m1)
##
## Call:
## glm.nb(formula = Global_Sales ~ Genre + Publisher, data = working_data1,
## init.theta = 1.603242539, link = log)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.0005 -0.9741 -0.8442 0.2504 8.3455
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.39357 0.05440 -7.235 4.64e-13
## GenreAdventure -0.67505 0.10717 -6.299 3.00e-10
## GenreFighting 0.29456 0.08630 3.413 0.000642
## GenreMisc -0.18253 0.06523 -2.798 0.005139
## GenrePlatform 0.26090 0.06809 3.832 0.000127
## GenrePuzzle -0.46182 0.11476 -4.024 5.71e-05
## GenreRacing 0.10154 0.06901 1.471 0.141160
## GenreRole-Playing 0.08064 0.07298 1.105 0.269210
## GenreShooter 0.45711 0.06117 7.473 7.82e-14
## GenreSimulation -0.11847 0.08311 -1.426 0.153992
## GenreSports -0.03990 0.05461 -0.731 0.464925
## GenreStrategy -0.63718 0.11326 -5.626 1.85e-08
## PublisherElectronic Arts 0.16512 0.06034 2.737 0.006207
## PublisherKonami Digital Entertainment -0.64490 0.08082 -7.979 1.47e-15
## PublisherNamco Bandai Games -0.91187 0.08398 -10.859 < 2e-16
## PublisherNintendo 1.33824 0.06234 21.466 < 2e-16
## PublisherSega -0.47383 0.08321 -5.695 1.24e-08
## PublisherSony Computer Entertainment 0.24237 0.06923 3.501 0.000463
## PublisherTake-Two Interactive 0.32840 0.07839 4.189 2.80e-05
## PublisherTHQ -0.38080 0.07819 -4.870 1.12e-06
## PublisherUbisoft -0.27338 0.07049 -3.878 0.000105
##
## (Intercept) ***
## GenreAdventure ***
## GenreFighting ***
## GenreMisc **
## GenrePlatform ***
## GenrePuzzle ***
## GenreRacing
## GenreRole-Playing
## GenreShooter ***
## GenreSimulation
## GenreSports
## GenreStrategy ***
## PublisherElectronic Arts **
## PublisherKonami Digital Entertainment ***
## PublisherNamco Bandai Games ***
## PublisherNintendo ***
## PublisherSega ***
## PublisherSony Computer Entertainment ***
## PublisherTake-Two Interactive ***
## PublisherTHQ ***
## PublisherUbisoft ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(1.6032) family taken to be 1)
##
## Null deviance: 10756.4 on 8214 degrees of freedom
## Residual deviance: 8765.4 on 8194 degrees of freedom
## AIC: 18035
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 1.6032
## Std. Err.: 0.0610
##
## 2 x log-likelihood: -17991.4240
Above is the detailed result R generate for us. Since our predict variables, Genre and Publisher, are discrete variables. R will compute all the result for each level. Specifically for every variable, glm.nb would tell us the coefficient, standard error, z-value and corresponding p-value. Also, the asterisk is the indicator that which variable is significant. The regression coefficient is not significant or not means whether each corresponding explanatory variable x has an influence on the dependent variables denote y. For the output we have above, all the variables have an influence on the dependent variable Global Sale, except Racing, Role-Play, Simulation and Sport genres does not.
Next, we are going to interpret the estimated coefficient. According to the traditional negative binomial regression model:
\[\ln\mu = \beta_{0} + \beta_{1} x_1 + \beta_{2} x_2 + ... + \beta_{p} x_p\]
where \(\mu > 0\) is the mean of Y, the dependent variabe, the predictor variables \(x_1, x_2, ... , x_p\) are given, and the corresponding coefficients \(\beta_{0}, \beta_{1}, \beta_{2}, ..., \beta_{p}\). In our regression, it should be:
\[\ln(E(Global\ Sales)) = \hat{\beta_{0}} + \hat{\beta_{1}}(Genre = Adventure) + \hat{\beta_{2}}(Genre = Fighting) + ... + \hat{\beta_{20}}(Publisher = Ubisoft)\]
This implies:
\[E(Global\ Sales) = e^{\hat{\beta_{0}} + \hat{\beta_{1}}(Genre = Adventure) + \hat{\beta_{2}}(Genre = Fighting) + ... + \hat{\beta_{19}}(Publisher = Take-Two Interactive) + \hat{\beta_{20}}(Publisher = Ubisoft)}\]
Then,
\[E(Global\ Sales) = e^{\hat{\beta_{0}}}*e^{\hat{\beta_{1}}(Genre = Adventure)}*e^{\hat{\beta_{2}}(Genre = Fighting)}*...*e^{\hat{\beta_{19}}(Publisher = Take-Two Interactive)}*e^{\hat{\beta_{20}}(Publisher = Ubisoft)}\]
For the example 2, we would have similar function and progress with the example 1, but in this example, we will use the working_data2 as our data set with the variables NA_Sales, Publisher (the same 10 publishers), and User_Score. Since there are a lot of missing values in variable User_Score, this dataset would be smaller than the first one.
m3 <- glm.nb(NA_Sales~User_Score+Publisher, data = working_data2)
summary(m3)
##
## Call:
## glm.nb(formula = NA_Sales ~ User_Score + Publisher, data = working_data2,
## init.theta = 3.139759402, link = log)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.5092 -0.9194 -0.8150 0.6229 9.0865
##
## Coefficients:
## Estimate Std. Error z value
## (Intercept) -1.467640 0.144516 -10.156
## User_Score 0.013434 0.001898 7.078
## PublisherElectronic Arts -0.204584 0.077186 -2.651
## PublisherKonami Digital Entertainment -1.127191 0.153308 -7.352
## PublisherNamco Bandai Games -1.126434 0.154775 -7.278
## PublisherNintendo 0.642490 0.088038 7.298
## PublisherSega -0.957700 0.137481 -6.966
## PublisherSony Computer Entertainment -0.170673 0.102269 -1.669
## PublisherTake-Two Interactive 0.078311 0.099191 0.790
## PublisherTHQ -0.663601 0.118596 -5.595
## PublisherUbisoft -0.529062 0.096636 -5.475
## Pr(>|z|)
## (Intercept) < 2e-16 ***
## User_Score 1.47e-12 ***
## PublisherElectronic Arts 0.00804 **
## PublisherKonami Digital Entertainment 1.95e-13 ***
## PublisherNamco Bandai Games 3.39e-13 ***
## PublisherNintendo 2.92e-13 ***
## PublisherSega 3.26e-12 ***
## PublisherSony Computer Entertainment 0.09514 .
## PublisherTake-Two Interactive 0.42982
## PublisherTHQ 2.20e-08 ***
## PublisherUbisoft 4.38e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(3.1398) family taken to be 1)
##
## Null deviance: 4679.1 on 4252 degrees of freedom
## Residual deviance: 4208.6 on 4242 degrees of freedom
## AIC: 7359.7
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 3.140
## Std. Err.: 0.300
##
## 2 x log-likelihood: -7335.680
Above is the detailed result R generate for model fit for North America Sales. For the output we have above, we would like to look at the significant indicator, we find out that all the dependent variables User_Score and Publisher have a significant influence on North America Sales except the Sony Computer Entertainment and Take-Two Interactive companies does not.
The estimated coefficient of this model is:
\[E(NA\ Sales) = e^{\hat{\beta_{0}}}*e^{\hat{\beta_{1}}(User\ Score)}*e^{\hat{\beta_{2}}(Publisher = EA)}*...*e^{\hat{\beta_{9}}(Publisher = THQ)}*e^{\hat{\beta_{10}}(Publisher = Ubisoft)}\]
References
Website for our data: https://www.kaggle.com/rush4ratio/video-game-sales-with-ratings
UCLA page: https://stats.idre.ucla.edu/r/dae/logit-regression/
Version Information: All code in this page is tested on STATA version 14.2.
To begin cleaning our data set, we only keep the variables of interest for the first regression: Global_Sales, Genre, and Publisher.
keep publisher global_sales genre
Our next step is to check for missing values in these variables. The code to do this is shown below. It turns out that of these three variables, only Genre has missing values.
drop if mi(genre)
Now we want to limit the data set to publishers with more than 400 games. To do this we count the number of observations by publisher.
tabulate publisher
The code above outputs a list that is four pages long. This is because the data set has over 500 publishers. But, the vast majority of the publishers have less than ten games. For this reason, we shrink our data set to only look at large publishers.
From this list, we can easily read off the names of the publishers with more than 400 games. These top producing publishers are Activision, Electronic Arts, Konami Digital Entertainment, Namco Bandai Games, Nintendo, Sega, Sony Computer Entertainment, THQ, Take-Two Interactive, and Ubisoft. So, we remove all other publishers from our dataset.
keep if publisher == "Activision" | publisher == "Electronic Arts" | publisher == "Konami Digital Entertainment" | publisher == "Namco Bandai Games" | publisher == "Nintendo" | publisher == "Sega" | publisher == "Sony Computer Entertainment" | publisher == "THQ" | publisher == "Take-Two Interactive" | publisher == "Ubisoft"
tabulate publisher
We can see that all the publishers listed have more than 400 games. And that there are far fewer than 500+ publishers.
Now that we the data cleanup and ready, we can begin the negative binomial regression. As we start, STATA is going to complain that our predictor variables Genre and Pubisher are strings, when they need to be treated as factors, so we convert them.
encode publisher, gen(n_publisher)
encode genre, gen(n_genre)
compress
After we create these new factor variable versions, make sure to run compress to save space. Now we can run the negative binomial regression. Here is the one line input:
nbreg global_sales i.n_genre i.n_publisher
The output of this code should return something that looks like this:
First, let’s explain what we are seeing in the output before we attempt to interpret it. The output first shows up that it fits a Poisson model followed by an intercept only model before finally fitting the negative binomial. The output also shows the log likelihood values for each model it creates but does not show us, this way we can see that the models are indeed getting better with each iteration.
Next, we see how many observations were used in the creation of the model (8,215 in the case shown above). Below the number of observations is the Wald chi-square statistic with 20 degrees of freedom. This can be used when testing for independence of row and column variables. In the next row we have a p-value. This is the p-value for a test that all of the estimated coefficients in the model are equal to zero, thus this tests the model as a whole. The next line is a pseudo \(R^{2}\).
The rest of the output is information about the coefficients of the negative binomial model. As an example, let’s look at the row beginning with Simulation. This is the expected difference in log count between Simulation and the reference level (Action) for the Genres in the model. The value of -.12 means that the expected log count for Simulation is .12 lower than the expected log count for Action. The next column tells us the standard error of the coefficient. We then get a z value and corresponding p-value. These two columns relate to the last two which are a 95% confidence interval. A single level of a variable is considererd significant at the 95% confidence level if the confidence interval does not contain 0.
Now, we can look at the output and draw conclusions. First, we see the the overall p-value of the model is very small, so this model is significant. Also, we can see that for all of the options under Publisher the coefficients are considered significantly different from zero at the 95% confidence level. We know this since for all of the publishers, zero is not included in the 95% confidence intervals for each of the coefficients. Looking at Genre, most of the coefficients are significantly different from zero at the 95% confidence level. The exceptions are the following: Racing, Role-Playing, Simulation, and Sports. These genres have zero contained in the 95% confidence intervals for the coefficients. Note that the reference levels are Activiation (for Publisher) and Action (for Genre).
We can use the margins command to better understand the binomial regression model. The margins command calculates the predicted counts at each level of Publisher, holding the other variables constant, in this case all levels of Genre at their respective means.
margins n_publisher, atmeans
When the code above is run, this is the output:
From the output, we can see that the predicted number of events for the publisher Activision is 0.66. It is easy to read from the above table that the publisher with the largest predicted number of events is Nintendo with 2.53, holding all Genre information constant. We run the same code for Genres, holding all the Publishers constant at their means.
margins n_genre, atmeans
From the above output table, it is easy to see that the genre with the largest predicted number of events is Shooter Games with 0.97, holding all publisher information constant. The genre with the smallest predicted number of events is Adventure Games with only 0.31, holding publisher constant.
The Binomial Regression Model equation is similar to that of Poission Regression. In both cases, the log of the response variable is a linear combination of the predictor variables:
\(log(Global\ Sales) = \hat{\beta_{0}} + \hat{\beta_{1}}(Genre = Adventure) + \hat{\beta_{2}}(Genre = Fighting) + ... + \hat{\beta_{19}}(Publisher = Take-Two Interactive) + \hat{\beta_{20}}(Publisher = Ubisoft)\)
This implies:
\(Global\ Sales = e^{\hat{\beta_{0}} + \hat{\beta_{1}}(Genre = Adventure) + \hat{\beta_{2}}(Genre = Fighting) + ... + \hat{\beta_{19}}(Publisher = Take-Two Interactive) + \hat{\beta_{20}}(Publisher = Ubisoft)}\)
\(Global\ Sales = e^{\hat{\beta_{0}}}*e^{\hat{\beta_{1}}(Genre = Adventure)}*e^{\hat{\beta_{2}}(Genre = Fighting)}*...*e^{\hat{\beta_{19}}(Publisher = Take-Two Interactive)}*e^{\hat{\beta_{20}}(Publisher = Ubisoft)}\)
What this tells us when we are in the log scale, the coefficients of the predictors have an additive effect on the value of the response. But, once we get out of the log scale, now the coefficients of the predicitors have a multiplicative effect on the response.
To begin cleaning our data set, we only keep the variables of interest for the second regression: NA_Sales, User_Score, and Publisher.
keep publisher na_sales user_score
Our next step is to check for missing values in these variables. The code to do this is shown below. It turns out that of these three variables, only User_Score has missing values, but it is missing over six thousand values. This means that this data set is going to be smaller than the data set used in the first regression.
drop if mi(user_score)
Now, we note that some of the User_Score inputs are not only missing. Some of the inputs are “tbd”. We need to drop these values.
drop if user_score == "tbd"
Now we want to limit the data set. So, we will only look at publishers that were examined in the first regression. Theses are Activision, Electronic Arts, Konami Digital Entertainment, Namco Bandai Games, Nintendo, Sega, Sony Computer Entertainment, THQ, Take-Two Interactive, and Ubisoft. So, we remove all other publishers from our dataset.
keep if publisher == "Activision" | publisher == "Electronic Arts" | publisher == "Konami Digital Entertainment" | publisher == "Namco Bandai Games" | publisher == "Nintendo" | publisher == "Sega" | publisher == "Sony Computer Entertainment" | publisher == "THQ" | publisher == "Take-Two Interactive" | publisher == "Ubisoft"
tabulate publisher
We can see that all the publishers listed were the ones from Example 1. And that there are far fewer than 500+ publishers that were in the original data set.
Now, we have a reduced data set. We still have to recode Publisher and User_Score since these two variables need to be treated as factors.
encode publisher, gen(n_publisher)
encode user_score, gen(n_user_score)
compress
After we create these new factor variables, we make sure to run compress to save space. Now we can run the negative binomial regression. Here is the one line input:
nbreg na_sales i.n_user_score i.n_publisher
The output of this code should return something that looks like this:
Now, we can look at the output and draw conclusions. First, we see the the overall p-value of the model is very small, so this model is significant. Also, we can see that for most of the options under Publisher the coefficients are considered significantly different from zero at the 95% confidence level. The two exceptions are Sony Computer Entertainment and Take-Two Interactive. We know this since for these two publishers, zero is included in the 95% confidence intervals for each of the respective coefficients.
We can use the margins command to better understand the binomial regression model. The margins command calculates the predicted counts at each level of Publisher, holding the other variables constant, in this case all levels of User_Score at their respective means.
margins n_publisher, atmeans
When the code above is run, this is the output:
From the output, we can see that the predicted number of events for the publisher Activision is 0.55. It is easy to read from the above table that the publisher with the largest predicted number of events is Nintendo with 1.05, holding all Genre information constant. The publisher with the smallest predicted number of events is Konami Digital Entertainment with 0.19.