Multinomial logistic regression

Cornelia Parker - TBlack Path (Bunhill Fields), 2013. Detail - Manchester University

Overview

Multinomial logistic regression is used in situations where the dependent variable has more than two categories. It is an extension of the basic logistic regression model. For example, if the dependent variable has three categories, A, B and C we would estimate regression models of A vs B, and A vs C. It is important to note that these models are estimated simultaneously.

This model is used to predict the probabilities of categorically dependent variable, which has more than two outcome classes. The dependent variables are nominal in nature. This means that there is no kind of ordering in the target classes ie these classes cannot be meaningfully ordered (eg 1,2,3,4 etc). Examples of multinomial logistic regression topics include:

Favorite ice cream flavor

Dependent Variable: Ice cream flavor - Chocolate, Vanilla, Coffee, Strawberry
Independent Variables: Gender, Age, Ethnicity

Current employment status

Dependent Variable: Employment status - Full-time employment, Part-time employment, Training, Unemployed
Independent Variables: Gender, Age, Qualifications, Area a person lives, number of children

Sentencing outcome status

Independent Variable: Sentencing status - Prison, Community Penalty, Fine, Discharge
Dependent Variables: Gender, Age, Previous Convictions

Assumptions:

Multinomial logistic regression assumes that:

1 Each independent variable has a single value for each case.

2 The dependent variable cannot be perfectly predicted from the independent variables for each case.

3 No multicollinearity between independent variables - This occurs when two or more independent variables are highly correlated with each other. This makes it difficult to understand how much every independent variable contributes to the category of dependent variable.

4 There should be no outliers in the data points.

Step 1: Installing the packages

The following packagaes are required sjmisc and sjPlot. To create the multinomial models the nnet package needs to be installed.

library(sjmisc)
library(sjPlot)
library(nnet)
library(wakefield)

Step 2: Creating a data frame

For the purposes of this exercise the wakefield package will be used to generate a random data set. This will be a psuedo data set to resemble the number of children in local authority who are either: a Child in Need (CIN), Subject to a Child Protection Plan (CP), or child looked after (CLA).

## Using the wakefield package to easily generate reproducible "CPP" sample data
set.seed(999)
df.CPP <- 
  r_data_frame(
    n = 12500,
    r_sample(n = 255, x = "CPP", prob = NULL, name = "Type"),
    id, 
    r_sample( x = c("Male", "Female"),
              prob = c(0.52, 0.48),
              name = "Sex"),
        r_sample( x = c("0", "1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", "14", "15", "16", "17"),
              prob = c(0.12, 0.08, 0.08, 0.08, 0.08, 0.06, 0.06, 0.06, 0.06, 0.06, 0.038, 0.038, 0.038, 0.038, 0.038, 0.038, 0.01, 0.01),
              name = "Age")
  )

## Using the wakefield package to easily generate reproducible "CIN" sample data
set.seed(999)
df.CIN <- 
  r_data_frame(
    n = 97000,
    r_sample(n = 255, x = "CIN", prob = NULL, name = "Type"),
    id, 
    r_sample( x = c("Male", "Female"),
              prob = c(0.55, 0.45),
              name = "Sex"),
    r_sample( x = c("0", "1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", "14", "15", "16", "17"),
              prob = c(0.05, 0.04, 0.04, 0.04, 0.04, 0.044, 0.044, 0.044, 0.044, 0.044, 0.053, 0.053, 0.053, 0.053, 0.053, 0.053, 0.115, 0.115),
              name = "Age")
  )

## Using the wakefield package to easily generate reproducible "CLA" sample data
set.seed(999)
df.CLA <- 
  r_data_frame(
    n = 16200,
    r_sample(n = 255, x = "CLA", prob = NULL, name = "Type"),
    id, 
    r_sample( x = c("Male", "Female"),
              prob = c(0.57, 0.43),
              name = "Sex"),
    r_sample( x = c("0", "1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", "14", "15", "16", "17"),
              prob = c(0.05, 0.035, 0.035, 0.035, 0.035, 0.038, 0.038, 0.038, 0.038, 0.038, 0.065, 0.065, 0.065, 0.065, 0.065, 0.065, 0.115, 0.115),
              name = "Age")
  )

# Merging the dataframes
mydata <- rbind(df.CIN, df.CLA, df.CPP)

head(mydata)

## # A tibble: 6 × 4
##   Type  ID    Sex    Age  
##   <chr> <chr> <chr>  <chr>
## 1 CIN   00001 Female 16   
## 2 CIN   00002 Male   10   
## 3 CIN   00003 Male   2    
## 4 CIN   00004 Male   14   
## 5 CIN   00005 Female 14   
## 6 CIN   00006 Female 16

Step 3: Exploratory data analysis

This set of data contains four variables.

The dependent varaible is type which details the Children’s Services Outcome for each child. Using multinomial modeling we will explore how this outcome varies by sex.

The next step is to start be requesting the frequencies of these two variables to check how they are coded.

frq(mydata, Type, Sex)

## Type <character> 
## # total N=125700 valid N=125700 mean=1.33 sd=0.65
## 
## Value |     N | Raw % | Valid % | Cum. %
## ----------------------------------------
## CIN   | 97000 | 77.17 |   77.17 |  77.17
## CLA   | 16200 | 12.89 |   12.89 |  90.06
## CPP   | 12500 |  9.94 |    9.94 | 100.00
## <NA>  |     0 |  0.00 |    <NA> |   <NA>
## 
## Sex <character> 
## # total N=125700 valid N=125700 mean=1.55 sd=0.50
## 
## Value  |     N | Raw % | Valid % | Cum. %
## -----------------------------------------
## Female | 56513 | 44.96 |   44.96 |  44.96
## Male   | 69187 | 55.04 |   55.04 | 100.00
## <NA>   |     0 |  0.00 |    <NA> |   <NA>

The most common outcome was CIN (77% of all children), followed by CPP( 13%). We can also see that there were 69,014 males and 56,686 females.

It is useful to see if there are any differences in Type by Sex with a cross-tabulation. Here we can see that CLA Children’s Services Outcomes are much more likely for males, and CPP are more common for females.

sjt.xtab(mydata$Type, mydata$Sex, show.col.prc = TRUE)

Type	Sex		Total
Type	Female	Male	Total
CIN	43647 77.2 %	53353 77.1 %	97000 77.2 %
CLA	6931 12.3 %	9269 13.4 %	16200 12.9 %
CPP	5935 10.5 %	6565 9.5 %	12500 9.9 %
Total	56513 100 %	69187 100 %	125700 100 %
χ²=63.131 · df=2 · Cramer’s V=0.022 · p=0.000

Step 4: Run a multinomial logistic regression

To run this regression the first step is to create a new “factor” variable, Type.f that replicates the values of Type but which replaces the values with text labels.

Check that the coding of the variable has worked.

mydata$Type.f <- factor(mydata$Type, labels = c("CIN", "CLA", "CPP"))
head(mydata, n = 10)

## # A tibble: 10 × 5
##    Type  ID    Sex    Age   Type.f
##    <chr> <chr> <chr>  <chr> <fct> 
##  1 CIN   00001 Female 16    CIN   
##  2 CIN   00002 Male   10    CIN   
##  3 CIN   00003 Male   2     CIN   
##  4 CIN   00004 Male   14    CIN   
##  5 CIN   00005 Female 14    CIN   
##  6 CIN   00006 Female 16    CIN   
##  7 CIN   00007 Male   16    CIN   
##  8 CIN   00008 Female 17    CIN   
##  9 CIN   00009 Male   7     CIN   
## 10 CIN   00010 Female 2     CIN

The next step of the multinomial logistic regression is to select a reference category. By default R orders the factors alphabetically, so would treat CIN as the reference group. However, it is possible to change this using the relevel command. We will choose CLA as the reference group.

mydata$Type.f <- relevel(mydata$Type.f, ref = "CLA")

It is useful to create a new dummy variable, male which has a 0 if female and 1 if male. The purpose of this is to make the gender effect clearer to see in the model.

mydata$Male <- rec(mydata$Sex, rec = "Male = 1; Female = 0; else = NA")

Finally, the multinom command is used to run a multinomial model.

The summary() command can be used once the model has converged to view the results.

model1 <-multinom(Type.f ~ Male, data = mydata)

## # weights:  9 (4 variable)
## initial  value 138095.564686 
## iter  10 value 87153.669778
## iter  10 value 87153.669764
## final  value 87153.669764 
## converged

summary(model1)

## Call:
## multinom(formula = Type.f ~ Male, data = mydata)
## 
## Coefficients:
##     (Intercept)       Male1
## CIN   1.8401323 -0.08990659
## CPP  -0.1551774 -0.18968963
## 
## Std. Errors:
##     (Intercept)      Male1
## CIN  0.01293019 0.01714112
## CPP  0.01768551 0.02393687
## 
## Residual Deviance: 174307.3 
## AIC: 174315.3

Step 5: Model results

The multinomial regression model results are included in the table Coefficients: with the associated standard errors included in the table Std.Errors: . The first colum (INtercept) includes the estimates for the reference category (female), whilst column male returns the results for the effect of being male compared to the reference. Model results are given in logits, so we need to take the exponential to read these as odds and odds ratios. This can be achieved by using the exp(coef()) command.

exp(coef(model1))

##     (Intercept)     Male1
## CIN   6.2973712 0.9140166
## CPP   0.8562633 0.8272158

To calculate the 95% confidence interval (CI) and restructure the results to make it easier to read use the tab_model function.

tab_model(model1)

	Type.f
Predictors	Odds Ratios	CI	p	Response
(Intercept)	6.30	6.14 – 6.46	<0.001	CIN
Male [1]	0.91	0.88 – 0.95	<0.001	CIN
(Intercept)	0.86	0.83 – 0.89	<0.001	CPP
Male [1]	0.83	0.79 – 0.87	<0.001	CPP
Observations	125700
R² / R² adjusted	0.000 / 0.000

Here we can see that the odds of becoming CIN rather than CLA are 6.3 for females. Compared to females, males have 0.91 times less odds of becoming CLA rather than becoming CIN.

Similarly, the odds of becoming subject to Child Protection Plan rather than CLA are 0.83 times less for a male than a female.

Underneath the table we can see the approximate R2 which is 0.000. We would therefore want to consider other potential predictors of Children’s Services Outcomes.

Step 7: Requesting predicted probabilities

Sometimes it is desirable to request the predicted probabilities to better understand the chances of receiving different Children’s Services Outcomes. This can be achieved by using the fitted() command.

predprob <-fitted(model1)
head(predprob)

##         CLA       CIN       CPP
## 1 0.1226447 0.7723392 0.1050161
## 2 0.1339725 0.7711328 0.0948947
## 3 0.1339725 0.7711328 0.0948947
## 4 0.1339725 0.7711328 0.0948947
## 5 0.1226447 0.7723392 0.1050161
## 6 0.1226447 0.7723392 0.1050161

The first row of data relate to males. The table illustrates that the probability of a male becoming CIN is 0.77, CPP is 0.11 and CLA is 0.12. The second row refers to females, with probabilities of CIN = 0.13, CPP = 0.09 and CLA is 0.77. All rounded to 2.d.p

Step 8: Adding a continuous predictor

It is possible to extend the model by adding in the age of the child/young person. This variable is needs to be numeric. For the purposes of this model the reference category will be CLA.

mydata$Age <- as.numeric(mydata$Age)

model2 <-multinom(Type.f ~ Male + Age, data = mydata)

## # weights:  12 (6 variable)
## initial  value 138095.564686 
## iter  10 value 86003.438226
## iter  20 value 84340.937713
## final  value 84340.823598 
## converged

summary(model2)

## Call:
## multinom(formula = Type.f ~ Male + Age, data = mydata)
## 
## Coefficients:
##     (Intercept)       Male1         Age
## CIN    2.010631 -0.08921876 -0.01681853
## CPP    1.039220 -0.18518794 -0.14516870
## 
## Std. Errors:
##     (Intercept)      Male1         Age
## CIN  0.02088667 0.01714897 0.001589901
## CPP  0.02602449 0.02437257 0.002306540
## 
## Residual Deviance: 168681.6 
## AIC: 168693.6

tab_model(model2)

	Type.f
Predictors	Odds Ratios	CI	p	Response
(Intercept)	7.47	7.17 – 7.78	<0.001	CIN
Male [1]	0.91	0.88 – 0.95	<0.001	CIN
Age	0.98	0.98 – 0.99	<0.001	CIN
(Intercept)	2.83	2.69 – 2.97	<0.001	CPP
Male [1]	0.83	0.79 – 0.87	<0.001	CPP
Age	0.86	0.86 – 0.87	<0.001	CPP
Observations	125700
R² / R² adjusted	0.033 / 0.033

The results in the table illustrate that the odds of becoming CIN rather than CLA tend to be lower as the age increases. For every year, the odds fall approximately 2% (1-0.98 = 0.02). This means that older children are less likely to become CIN rather LAC.

The same is true when we consider the odds of CPP rather than LAC. The odds fall approximately 14%.

Underneath the table we can see the approximate R2 which is 0.03. We would therefore want to consider other potential predictors of Children’s Services Outcomes.

Step 8: Exploring changes in predicted probability associated with gender whilst holding age constant

Finally, is we want to explore the changes in predicted probability associated with Sex whilst holding age constant, we can create another dataset varying gender, but holding the Age at the sample mean.

sub.Type <- data.frame(Male = c(0,1), Age = mean(mydata$Age))
sub.Type$Male <- as.factor(sub.Type$Male)
p <- predict(model2, newdata = sub.Type, "probs")
barplot(p, ylab = "Probability", ylim=c(0,1), xlab = "Children's Service Outcome", beside = T, legend = TRUE)

For someone of average age, the predicted probability of becoming CIN for a woman is 0.79 compared to 0.79 for a man.

The probability of a male (of average age) becoming CPP is approximately 0.09 compared to 0.08 for a female.

The probability of a male (of average age) becoming CLA is approximately 0.12 compared to 0.14 for a female.

Multinomial logistic regression

2022-08-22