This note shows you how to estimate multinomial logistic regression model.

For this example, I will use titanic package and use Pclass as my dependent variable. Let’s take a look at the dataset using str and summary functions from base R. Recall that I have been using stat.desc function from pastecs package instead of summary as summary doesn’t report number of missing values. However, pastecs output is not really useful for a couple of reasons. First, it defaults to scientific notations so we may need some formatting. Second, it doesn’t report any statistics for character variables. So in the following code I am using some simple code to report number of missing values in each variable.

library(titanic)
library()

d1 <- titanic_train
d1 <- subset(d1, d1$Embarked != "") # delete rows where Embarked didn't have any value
str(d1)
## 'data.frame':    889 obs. of  12 variables:
##  $ PassengerId: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Survived   : int  0 1 1 1 0 0 0 0 1 1 ...
##  $ Pclass     : int  3 1 3 1 3 3 1 3 3 2 ...
##  $ Name       : chr  "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
##  $ Sex        : chr  "male" "female" "female" "female" ...
##  $ Age        : num  22 38 26 35 35 NA 54 2 27 14 ...
##  $ SibSp      : int  1 1 0 1 0 0 0 3 0 1 ...
##  $ Parch      : int  0 0 0 0 0 0 0 1 2 0 ...
##  $ Ticket     : chr  "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
##  $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
##  $ Cabin      : chr  "" "C85" "" "C123" ...
##  $ Embarked   : chr  "S" "C" "S" "S" ...
summary(d1)
##   PassengerId     Survived          Pclass          Name          
##  Min.   :  1   Min.   :0.0000   Min.   :1.000   Length:889        
##  1st Qu.:224   1st Qu.:0.0000   1st Qu.:2.000   Class :character  
##  Median :446   Median :0.0000   Median :3.000   Mode  :character  
##  Mean   :446   Mean   :0.3825   Mean   :2.312                     
##  3rd Qu.:668   3rd Qu.:1.0000   3rd Qu.:3.000                     
##  Max.   :891   Max.   :1.0000   Max.   :3.000                     
##                                                                   
##      Sex                 Age            SibSp            Parch       
##  Length:889         Min.   : 0.42   Min.   :0.0000   Min.   :0.0000  
##  Class :character   1st Qu.:20.00   1st Qu.:0.0000   1st Qu.:0.0000  
##  Mode  :character   Median :28.00   Median :0.0000   Median :0.0000  
##                     Mean   :29.64   Mean   :0.5242   Mean   :0.3825  
##                     3rd Qu.:38.00   3rd Qu.:1.0000   3rd Qu.:0.0000  
##                     Max.   :80.00   Max.   :8.0000   Max.   :6.0000  
##                     NA's   :177                                      
##     Ticket               Fare            Cabin          
##  Length:889         Min.   :  0.000   Length:889        
##  Class :character   1st Qu.:  7.896   Class :character  
##  Mode  :character   Median : 14.454   Mode  :character  
##                     Mean   : 32.097                     
##                     3rd Qu.: 31.000                     
##                     Max.   :512.329                     
##                                                         
##    Embarked        
##  Length:889        
##  Class :character  
##  Mode  :character  
##                    
##                    
##                    
## 
colSums(is.na(d1)) # This reports the total number of missing values for each variable
## PassengerId    Survived      Pclass        Name         Sex         Age 
##           0           0           0           0           0         177 
##       SibSp       Parch      Ticket        Fare       Cabin    Embarked 
##           0           0           0           0           0           0

Age is the variable that has several missing values. If you recall, one way we could overcome this was by using mean replacement. However, for this example I am going to drop this variable from the model. If I were doing a real analysis, I would include Age because clearly it will have bearing on the passenger class. As an aside, here is the average age of the passengers for each class.

a <- mean(d1[d1$Pclass == 1, c("Age")], na.rm=T)
b <- mean(d1[d1$Pclass == 2, c("Age")], na.rm=T)
c <- mean(d1[d1$Pclass == 3, c("Age")], na.rm=T)

print(paste("The average Age of people on 1st class is", a, "years"))
## [1] "The average Age of people on 1st class is 38.1055434782609 years"
print(paste("The average Age of people on 2nd class is", b, "years"))
## [1] "The average Age of people on 2nd class is 29.8776300578035 years"
print(paste("The average Age of people on 3rd class is", c, "years"))
## [1] "The average Age of people on 3rd class is 25.1406197183099 years"

Clearly Age is discriminating among the classes. People traveing on 1st class are older than the ones on 2nd, who are older than the ones on 3rd class. But I will drop it anyway because my objective is to show you how to estimate a multinomial logistic regression model rather than build a robust model.

Multinomial Logistic Regression

I am going to estimate a model that has the following form:

\[P(Pclass) = f(Sex, Embarked, SibSp, Parch) \]

We know that all the variable values are non missing. Remember that Pclass, Sex and Embarked are all categoral variables. In order to use them in the model, we must declare them as such so that we don’t have to worry about creating dummy variables for them. Besides that Pclass is a categorical dependent variable and to estimate a multinomial model, most packages require it to be a factor.

There are many packages in R that can estimate a multinomial logistic regression model. I will use nnet package because of its simplicity. There is a nice tutorial on this topic on the UCLA website.

library(nnet)

d1$Pclass <- factor(d1$Pclass)
d1$Sex <- factor(d1$Sex)
d1$Embarked <- factor(d1$Embarked)

formula <- "Pclass ~ Sex + Embarked + SibSp + Parch"

mlogit <- nnet::multinom(formula, data = d1)
## # weights:  21 (12 variable)
## initial  value 976.666325 
## iter  10 value 812.978209
## final  value 806.250967 
## converged

After we run the model, nnet reports the summary iterations and declaration about model convergence. In our case, the model converged quite quickly.

Let’s take a look at the output of the model.

output <- summary(mlogit)

print(output)
## Call:
## nnet::multinom(formula = formula, data = d1)
## 
## Coefficients:
##   (Intercept)    Sexmale EmbarkedQ EmbarkedS       SibSp      Parch
## 2  -1.5642365 -0.0698696  2.004449  1.876297 -0.07211907 0.04548861
## 3  -0.8142428  0.7559771  3.978519  1.164160  0.19712980 0.11427709
## 
## Std. Errors:
##   (Intercept)   Sexmale EmbarkedQ EmbarkedS     SibSp     Parch
## 2   0.3006577 0.2238023 0.9523768 0.2922098 0.1343888 0.1464731
## 3   0.2159193 0.1920332 0.7399698 0.1972837 0.1058155 0.1258582
## 
## Residual Deviance: 1612.502 
## AIC: 1636.502

First thing you will notice is that there are too many parameters (coefficients). In our model we had 4 variables but we now have 5 variables in the output (columns). Interestingly, the names of the first 3 variables Sexmale, EmbarkedQ, and EmbarkedS are unfamiliar to us. We don’t have them in the data!

Recall that we converted Sex and Embarked to factor variables. When we use them in the model, R automativally creates dummy variables for them. As Sex has 2 levels, only 1 dummy was created. female is taken as the reference level and therefore the dummy variable is named Sexmale as it indicates presence or absence of being a male. There are 3 levels in Embarked: C, Q, and S. Therefore we have 2 dummy variables. C is taken as a reference so we have EmbarkedQ and EmbarkedS as two dummy variables for Q and S respectively.

Writing the probability equations

Further, usually we get one set of estimates from the model but here we clearly see two sets in two rows. What’s going on? In logistic regression, one level of the dependent variable is taken as reference and separate model coefficients are estimated on the remaining levels. In our case Pclass has 3 levels (1st, 2nd, and 3rd class). By default the lowest level (Pclass = 1) is taken as a reference.1 Therefore, for the remaining two levels 2 and 3 we get model coefficients. That’s why in the output above you see the rows for Coefficients are marked 2 and 3.

Let’s write these in two equations so you will understand what’s going on.

\[\begin{aligned} ln\bigg(\frac{p(Pclass = 2)}{p(Pclass = 1)}\bigg) = &- 1.5642365 - 0.0698696\cdot Sexmale + 2.004449\cdot EmbarkedQ \\&+ 1.876297\cdot EmbarkedS - 0.07211907\cdot SibSp + 0.04548861\cdot Parch \end{aligned} \]

\[\begin{aligned} ln\bigg(\frac{p(Pclass = 3)}{p(Pclass = 1)}\bigg) = &- 0.8142428 + 0.7559771\cdot Sexmale + 3.978519\cdot EmbarkedQ \\&+ 1.164160\cdot EmbarkedS + 0.19712980\cdot SibSp + 0.11427709\cdot Parch \end{aligned} \]

Getting the Z-stats and p-values

You may have noticed that multinom doesn’t output Z statistics and p-values. It’s however quite easy to get them. Z statistics are simply ratios of model coefficients and standard errors. Once we get them, we can get the p-values using the standard normal distribution.

z <- output$coefficients/output$standard.errors
p <- (1 - pnorm(abs(z), 0, 1))*2 # we are using two-tailed z test

Let’s get a table of coefficients, standard errors, z stats, and p values for Pclass = 2

Pclass2 <- rbind(output$coefficients[1,],output$standard.errors[1,],z[1,],p[1,])
rownames(Pclass2) <- c("Coefficient","Std. Errors","z stat","p value")
knitr::kable(Pclass2)
(Intercept) Sexmale EmbarkedQ EmbarkedS SibSp Parch
Coefficient -1.5642365 -0.0698696 2.0044491 1.8762971 -0.0721191 0.0454886
Std. Errors 0.3006577 0.2238023 0.9523768 0.2922098 0.1343888 0.1464731
z stat -5.2027160 -0.3121934 2.1046807 6.4210604 -0.5366448 0.3105595
p value 0.0000002 0.7548936 0.0353191 0.0000000 0.5915130 0.7561355

Let’s also get a table of coefficients, standard errors, z stats, and p values for Pclass = 3

Pclass3 <- rbind(output$coefficients[2,],output$standard.errors[2,],z[2,],p[2,])
rownames(Pclass3) <- c("Coefficient","Std. Errors","z stat","p value")
knitr::kable(Pclass3)
(Intercept) Sexmale EmbarkedQ EmbarkedS SibSp Parch
Coefficient -0.8142428 0.7559771 3.9785191 1.1641600 0.1971298 0.1142771
Std. Errors 0.2159193 0.1920332 0.7399698 0.1972837 0.1058155 0.1258582
z stat -3.7710519 3.9366989 5.3765966 5.9009436 1.8629576 0.9079831
p value 0.0001626 0.0000826 0.0000001 0.0000000 0.0624682 0.3638871

  1. You can alter it if you want by using relevel function in base R