This note shows you how to estimate multinomial logistic regression model.
For this example, I will use titanic package and use Pclass as my dependent variable. Let’s take a look at the dataset using str and summary functions from base R. Recall that I have been using stat.desc function from pastecs package instead of summary as summary doesn’t report number of missing values. However, pastecs output is not really useful for a couple of reasons. First, it defaults to scientific notations so we may need some formatting. Second, it doesn’t report any statistics for character variables. So in the following code I am using some simple code to report number of missing values in each variable.
library(titanic)
library()
d1 <- titanic_train
d1 <- subset(d1, d1$Embarked != "") # delete rows where Embarked didn't have any value
str(d1)
## 'data.frame': 889 obs. of 12 variables:
## $ PassengerId: int 1 2 3 4 5 6 7 8 9 10 ...
## $ Survived : int 0 1 1 1 0 0 0 0 1 1 ...
## $ Pclass : int 3 1 3 1 3 3 1 3 3 2 ...
## $ Name : chr "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
## $ Sex : chr "male" "female" "female" "female" ...
## $ Age : num 22 38 26 35 35 NA 54 2 27 14 ...
## $ SibSp : int 1 1 0 1 0 0 0 3 0 1 ...
## $ Parch : int 0 0 0 0 0 0 0 1 2 0 ...
## $ Ticket : chr "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
## $ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
## $ Cabin : chr "" "C85" "" "C123" ...
## $ Embarked : chr "S" "C" "S" "S" ...
summary(d1)
## PassengerId Survived Pclass Name
## Min. : 1 Min. :0.0000 Min. :1.000 Length:889
## 1st Qu.:224 1st Qu.:0.0000 1st Qu.:2.000 Class :character
## Median :446 Median :0.0000 Median :3.000 Mode :character
## Mean :446 Mean :0.3825 Mean :2.312
## 3rd Qu.:668 3rd Qu.:1.0000 3rd Qu.:3.000
## Max. :891 Max. :1.0000 Max. :3.000
##
## Sex Age SibSp Parch
## Length:889 Min. : 0.42 Min. :0.0000 Min. :0.0000
## Class :character 1st Qu.:20.00 1st Qu.:0.0000 1st Qu.:0.0000
## Mode :character Median :28.00 Median :0.0000 Median :0.0000
## Mean :29.64 Mean :0.5242 Mean :0.3825
## 3rd Qu.:38.00 3rd Qu.:1.0000 3rd Qu.:0.0000
## Max. :80.00 Max. :8.0000 Max. :6.0000
## NA's :177
## Ticket Fare Cabin
## Length:889 Min. : 0.000 Length:889
## Class :character 1st Qu.: 7.896 Class :character
## Mode :character Median : 14.454 Mode :character
## Mean : 32.097
## 3rd Qu.: 31.000
## Max. :512.329
##
## Embarked
## Length:889
## Class :character
## Mode :character
##
##
##
##
colSums(is.na(d1)) # This reports the total number of missing values for each variable
## PassengerId Survived Pclass Name Sex Age
## 0 0 0 0 0 177
## SibSp Parch Ticket Fare Cabin Embarked
## 0 0 0 0 0 0
Age is the variable that has several missing values. If you recall, one way we could overcome this was by using mean replacement. However, for this example I am going to drop this variable from the model. If I were doing a real analysis, I would include Age because clearly it will have bearing on the passenger class. As an aside, here is the average age of the passengers for each class.
a <- mean(d1[d1$Pclass == 1, c("Age")], na.rm=T)
b <- mean(d1[d1$Pclass == 2, c("Age")], na.rm=T)
c <- mean(d1[d1$Pclass == 3, c("Age")], na.rm=T)
print(paste("The average Age of people on 1st class is", a, "years"))
## [1] "The average Age of people on 1st class is 38.1055434782609 years"
print(paste("The average Age of people on 2nd class is", b, "years"))
## [1] "The average Age of people on 2nd class is 29.8776300578035 years"
print(paste("The average Age of people on 3rd class is", c, "years"))
## [1] "The average Age of people on 3rd class is 25.1406197183099 years"
Clearly Age is discriminating among the classes. People traveing on 1st class are older than the ones on 2nd, who are older than the ones on 3rd class. But I will drop it anyway because my objective is to show you how to estimate a multinomial logistic regression model rather than build a robust model.
I am going to estimate a model that has the following form:
\[P(Pclass) = f(Sex, Embarked, SibSp, Parch) \]
We know that all the variable values are non missing. Remember that Pclass, Sex and Embarked are all categoral variables. In order to use them in the model, we must declare them as such so that we don’t have to worry about creating dummy variables for them. Besides that Pclass is a categorical dependent variable and to estimate a multinomial model, most packages require it to be a factor.
There are many packages in R that can estimate a multinomial logistic regression model. I will use nnet package because of its simplicity. There is a nice tutorial on this topic on the UCLA website.
library(nnet)
d1$Pclass <- factor(d1$Pclass)
d1$Sex <- factor(d1$Sex)
d1$Embarked <- factor(d1$Embarked)
formula <- "Pclass ~ Sex + Embarked + SibSp + Parch"
mlogit <- nnet::multinom(formula, data = d1)
## # weights: 21 (12 variable)
## initial value 976.666325
## iter 10 value 812.978209
## final value 806.250967
## converged
After we run the model, nnet reports the summary iterations and declaration about model convergence. In our case, the model converged quite quickly.
Let’s take a look at the output of the model.
output <- summary(mlogit)
print(output)
## Call:
## nnet::multinom(formula = formula, data = d1)
##
## Coefficients:
## (Intercept) Sexmale EmbarkedQ EmbarkedS SibSp Parch
## 2 -1.5642365 -0.0698696 2.004449 1.876297 -0.07211907 0.04548861
## 3 -0.8142428 0.7559771 3.978519 1.164160 0.19712980 0.11427709
##
## Std. Errors:
## (Intercept) Sexmale EmbarkedQ EmbarkedS SibSp Parch
## 2 0.3006577 0.2238023 0.9523768 0.2922098 0.1343888 0.1464731
## 3 0.2159193 0.1920332 0.7399698 0.1972837 0.1058155 0.1258582
##
## Residual Deviance: 1612.502
## AIC: 1636.502
First thing you will notice is that there are too many parameters (coefficients). In our model we had 4 variables but we now have 5 variables in the output (columns). Interestingly, the names of the first 3 variables Sexmale, EmbarkedQ, and EmbarkedS are unfamiliar to us. We don’t have them in the data!
Recall that we converted Sex and Embarked to factor variables. When we use them in the model, R automativally creates dummy variables for them. As Sex has 2 levels, only 1 dummy was created. female is taken as the reference level and therefore the dummy variable is named Sexmale as it indicates presence or absence of being a male. There are 3 levels in Embarked: C, Q, and S. Therefore we have 2 dummy variables. C is taken as a reference so we have EmbarkedQ and EmbarkedS as two dummy variables for Q and S respectively.
Further, usually we get one set of estimates from the model but here we clearly see two sets in two rows. What’s going on? In logistic regression, one level of the dependent variable is taken as reference and separate model coefficients are estimated on the remaining levels. In our case Pclass has 3 levels (1st, 2nd, and 3rd class). By default the lowest level (Pclass = 1) is taken as a reference.1 Therefore, for the remaining two levels 2 and 3 we get model coefficients. That’s why in the output above you see the rows for Coefficients are marked 2 and 3.
Let’s write these in two equations so you will understand what’s going on.
\[\begin{aligned} ln\bigg(\frac{p(Pclass = 2)}{p(Pclass = 1)}\bigg) = &- 1.5642365 - 0.0698696\cdot Sexmale + 2.004449\cdot EmbarkedQ \\&+ 1.876297\cdot EmbarkedS - 0.07211907\cdot SibSp + 0.04548861\cdot Parch \end{aligned} \]
\[\begin{aligned} ln\bigg(\frac{p(Pclass = 3)}{p(Pclass = 1)}\bigg) = &- 0.8142428 + 0.7559771\cdot Sexmale + 3.978519\cdot EmbarkedQ \\&+ 1.164160\cdot EmbarkedS + 0.19712980\cdot SibSp + 0.11427709\cdot Parch \end{aligned} \]
You may have noticed that multinom doesn’t output Z statistics and p-values. It’s however quite easy to get them. Z statistics are simply ratios of model coefficients and standard errors. Once we get them, we can get the p-values using the standard normal distribution.
z <- output$coefficients/output$standard.errors
p <- (1 - pnorm(abs(z), 0, 1))*2 # we are using two-tailed z test
Let’s get a table of coefficients, standard errors, z stats, and p values for Pclass = 2
Pclass2 <- rbind(output$coefficients[1,],output$standard.errors[1,],z[1,],p[1,])
rownames(Pclass2) <- c("Coefficient","Std. Errors","z stat","p value")
knitr::kable(Pclass2)
| (Intercept) | Sexmale | EmbarkedQ | EmbarkedS | SibSp | Parch | |
|---|---|---|---|---|---|---|
| Coefficient | -1.5642365 | -0.0698696 | 2.0044491 | 1.8762971 | -0.0721191 | 0.0454886 |
| Std. Errors | 0.3006577 | 0.2238023 | 0.9523768 | 0.2922098 | 0.1343888 | 0.1464731 |
| z stat | -5.2027160 | -0.3121934 | 2.1046807 | 6.4210604 | -0.5366448 | 0.3105595 |
| p value | 0.0000002 | 0.7548936 | 0.0353191 | 0.0000000 | 0.5915130 | 0.7561355 |
Let’s also get a table of coefficients, standard errors, z stats, and p values for Pclass = 3
Pclass3 <- rbind(output$coefficients[2,],output$standard.errors[2,],z[2,],p[2,])
rownames(Pclass3) <- c("Coefficient","Std. Errors","z stat","p value")
knitr::kable(Pclass3)
| (Intercept) | Sexmale | EmbarkedQ | EmbarkedS | SibSp | Parch | |
|---|---|---|---|---|---|---|
| Coefficient | -0.8142428 | 0.7559771 | 3.9785191 | 1.1641600 | 0.1971298 | 0.1142771 |
| Std. Errors | 0.2159193 | 0.1920332 | 0.7399698 | 0.1972837 | 0.1058155 | 0.1258582 |
| z stat | -3.7710519 | 3.9366989 | 5.3765966 | 5.9009436 | 1.8629576 | 0.9079831 |
| p value | 0.0001626 | 0.0000826 | 0.0000001 | 0.0000000 | 0.0624682 | 0.3638871 |
You can alter it if you want by using relevel function in base R↩