Cornelia Parker - TBlack Path (Bunhill Fields), 2013. Detail - Manchester University

Overview

Multinomial logistic regression is used in situations where the dependent variable has more than two categories. It is an extension of the basic logistic regression model. For example, if the dependent variable has three categories, A, B and C we would estimate regression models of A vs B, and A vs C. It is important to note that these models are estimated simultaneously.

This model is used to predict the probabilities of categorically dependent variable, which has more than two outcome classes. The dependent variables are nominal in nature. This means that there is no kind of ordering in the target classes ie these classes cannot be meaningfully ordered (eg 1,2,3,4 etc). Examples of multinomial logistic regression topics include:

Favorite ice cream flavor

Current employment status

Sentencing outcome status

Assumptions:

Multinomial logistic regression assumes that:

1 Each independent variable has a single value for each case.

2 The dependent variable cannot be perfectly predicted from the independent variables for each case.

3 No multicollinearity between independent variables - This occurs when two or more independent variables are highly correlated with each other. This makes it difficult to understand how much every independent variable contributes to the category of dependent variable.

4 There should be no outliers in the data points.

Step 1: Installing the packages

The following packagaes are required sjmisc and sjPlot. To create the multinomial models the nnet package needs to be installed.

library(sjmisc)
library(sjPlot)
library(nnet)
library(wakefield)

Step 2: Creating a data frame

For the purposes of this exercise the wakefield package will be used to generate a random data set. This will be a psuedo data set to resemble the number of children in local authority who are either: a Child in Need (CIN), Subject to a Child Protection Plan (CP), or child looked after (CLA).

## Using the wakefield package to easily generate reproducible "CPP" sample data
set.seed(999)
df.CPP <- 
  r_data_frame(
    n = 12500,
    r_sample(n = 255, x = "CPP", prob = NULL, name = "Type"),
    id, 
    r_sample( x = c("Male", "Female"),
              prob = c(0.52, 0.48),
              name = "Sex"),
        r_sample( x = c("0", "1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", "14", "15", "16", "17"),
              prob = c(0.12, 0.08, 0.08, 0.08, 0.08, 0.06, 0.06, 0.06, 0.06, 0.06, 0.038, 0.038, 0.038, 0.038, 0.038, 0.038, 0.01, 0.01),
              name = "Age")
  )

## Using the wakefield package to easily generate reproducible "CIN" sample data
set.seed(999)
df.CIN <- 
  r_data_frame(
    n = 97000,
    r_sample(n = 255, x = "CIN", prob = NULL, name = "Type"),
    id, 
    r_sample( x = c("Male", "Female"),
              prob = c(0.55, 0.45),
              name = "Sex"),
    r_sample( x = c("0", "1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", "14", "15", "16", "17"),
              prob = c(0.05, 0.04, 0.04, 0.04, 0.04, 0.044, 0.044, 0.044, 0.044, 0.044, 0.053, 0.053, 0.053, 0.053, 0.053, 0.053, 0.115, 0.115),
              name = "Age")
  )

## Using the wakefield package to easily generate reproducible "CLA" sample data
set.seed(999)
df.CLA <- 
  r_data_frame(
    n = 16200,
    r_sample(n = 255, x = "CLA", prob = NULL, name = "Type"),
    id, 
    r_sample( x = c("Male", "Female"),
              prob = c(0.57, 0.43),
              name = "Sex"),
    r_sample( x = c("0", "1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", "14", "15", "16", "17"),
              prob = c(0.05, 0.035, 0.035, 0.035, 0.035, 0.038, 0.038, 0.038, 0.038, 0.038, 0.065, 0.065, 0.065, 0.065, 0.065, 0.065, 0.115, 0.115),
              name = "Age")
  )

# Merging the dataframes
mydata <- rbind(df.CIN, df.CLA, df.CPP)
head(mydata)
## # A tibble: 6 × 4
##   Type  ID    Sex    Age  
##   <chr> <chr> <chr>  <chr>
## 1 CIN   00001 Female 16   
## 2 CIN   00002 Male   10   
## 3 CIN   00003 Male   2    
## 4 CIN   00004 Male   14   
## 5 CIN   00005 Female 14   
## 6 CIN   00006 Female 16

Step 3: Exploratory data analysis

This set of data contains four variables.

The dependent varaible is type which details the Children’s Services Outcome for each child. Using multinomial modeling we will explore how this outcome varies by sex.

The next step is to start be requesting the frequencies of these two variables to check how they are coded.

frq(mydata, Type, Sex)
## Type <character> 
## # total N=125700 valid N=125700 mean=1.33 sd=0.65
## 
## Value |     N | Raw % | Valid % | Cum. %
## ----------------------------------------
## CIN   | 97000 | 77.17 |   77.17 |  77.17
## CLA   | 16200 | 12.89 |   12.89 |  90.06
## CPP   | 12500 |  9.94 |    9.94 | 100.00
## <NA>  |     0 |  0.00 |    <NA> |   <NA>
## 
## Sex <character> 
## # total N=125700 valid N=125700 mean=1.55 sd=0.50
## 
## Value  |     N | Raw % | Valid % | Cum. %
## -----------------------------------------
## Female | 56513 | 44.96 |   44.96 |  44.96
## Male   | 69187 | 55.04 |   55.04 | 100.00
## <NA>   |     0 |  0.00 |    <NA> |   <NA>

The most common outcome was CIN (77% of all children), followed by CPP( 13%). We can also see that there were 69,014 males and 56,686 females.

It is useful to see if there are any differences in Type by Sex with a cross-tabulation. Here we can see that CLA Children’s Services Outcomes are much more likely for males, and CPP are more common for females.

sjt.xtab(mydata$Type, mydata$Sex, show.col.prc = TRUE)
Type Sex Total
Female Male
CIN 43647
77.2 %
53353
77.1 %
97000
77.2 %
CLA 6931
12.3 %
9269
13.4 %
16200
12.9 %
CPP 5935
10.5 %
6565
9.5 %
12500
9.9 %
Total 56513
100 %
69187
100 %
125700
100 %
χ2=63.131 · df=2 · Cramer’s V=0.022 · p=0.000

Step 4: Run a multinomial logistic regression

To run this regression the first step is to create a new “factor” variable, Type.f that replicates the values of Type but which replaces the values with text labels.

Check that the coding of the variable has worked.

mydata$Type.f <- factor(mydata$Type, labels = c("CIN", "CLA", "CPP"))
head(mydata, n = 10)
## # A tibble: 10 × 5
##    Type  ID    Sex    Age   Type.f
##    <chr> <chr> <chr>  <chr> <fct> 
##  1 CIN   00001 Female 16    CIN   
##  2 CIN   00002 Male   10    CIN   
##  3 CIN   00003 Male   2     CIN   
##  4 CIN   00004 Male   14    CIN   
##  5 CIN   00005 Female 14    CIN   
##  6 CIN   00006 Female 16    CIN   
##  7 CIN   00007 Male   16    CIN   
##  8 CIN   00008 Female 17    CIN   
##  9 CIN   00009 Male   7     CIN   
## 10 CIN   00010 Female 2     CIN

The next step of the multinomial logistic regression is to select a reference category. By default R orders the factors alphabetically, so would treat CIN as the reference group. However, it is possible to change this using the relevel command. We will choose CLA as the reference group.

mydata$Type.f <- relevel(mydata$Type.f, ref = "CLA")

It is useful to create a new dummy variable, male which has a 0 if female and 1 if male. The purpose of this is to make the gender effect clearer to see in the model.

mydata$Male <- rec(mydata$Sex, rec = "Male = 1; Female = 0; else = NA")

Finally, the multinom command is used to run a multinomial model.

The summary() command can be used once the model has converged to view the results.

model1 <-multinom(Type.f ~ Male, data = mydata)
## # weights:  9 (4 variable)
## initial  value 138095.564686 
## iter  10 value 87153.669778
## iter  10 value 87153.669764
## final  value 87153.669764 
## converged
summary(model1)
## Call:
## multinom(formula = Type.f ~ Male, data = mydata)
## 
## Coefficients:
##     (Intercept)       Male1
## CIN   1.8401323 -0.08990659
## CPP  -0.1551774 -0.18968963
## 
## Std. Errors:
##     (Intercept)      Male1
## CIN  0.01293019 0.01714112
## CPP  0.01768551 0.02393687
## 
## Residual Deviance: 174307.3 
## AIC: 174315.3

Step 5: Model results

The multinomial regression model results are included in the table Coefficients: with the associated standard errors included in the table Std.Errors: . The first colum (INtercept) includes the estimates for the reference category (female), whilst column male returns the results for the effect of being male compared to the reference. Model results are given in logits, so we need to take the exponential to read these as odds and odds ratios. This can be achieved by using the exp(coef()) command.

exp(coef(model1))
##     (Intercept)     Male1
## CIN   6.2973712 0.9140166
## CPP   0.8562633 0.8272158

To calculate the 95% confidence interval (CI) and restructure the results to make it easier to read use the tab_model function.

tab_model(model1)
  Type.f
Predictors Odds Ratios CI p Response
(Intercept) 6.30 6.14 – 6.46 <0.001 CIN
Male [1] 0.91 0.88 – 0.95 <0.001 CIN
(Intercept) 0.86 0.83 – 0.89 <0.001 CPP
Male [1] 0.83 0.79 – 0.87 <0.001 CPP
Observations 125700
R2 / R2 adjusted 0.000 / 0.000

Here we can see that the odds of becoming CIN rather than CLA are 6.3 for females. Compared to females, males have 0.91 times less odds of becoming CLA rather than becoming CIN.

Similarly, the odds of becoming subject to Child Protection Plan rather than CLA are 0.83 times less for a male than a female.

Underneath the table we can see the approximate R2 which is 0.000. We would therefore want to consider other potential predictors of Children’s Services Outcomes.

Step 7: Requesting predicted probabilities

Sometimes it is desirable to request the predicted probabilities to better understand the chances of receiving different Children’s Services Outcomes. This can be achieved by using the fitted() command.

predprob <-fitted(model1)
head(predprob)
##         CLA       CIN       CPP
## 1 0.1226447 0.7723392 0.1050161
## 2 0.1339725 0.7711328 0.0948947
## 3 0.1339725 0.7711328 0.0948947
## 4 0.1339725 0.7711328 0.0948947
## 5 0.1226447 0.7723392 0.1050161
## 6 0.1226447 0.7723392 0.1050161

The first row of data relate to males. The table illustrates that the probability of a male becoming CIN is 0.77, CPP is 0.11 and CLA is 0.12. The second row refers to females, with probabilities of CIN = 0.13, CPP = 0.09 and CLA is 0.77. All rounded to 2.d.p

Step 8: Adding a continuous predictor

It is possible to extend the model by adding in the age of the child/young person. This variable is needs to be numeric. For the purposes of this model the reference category will be CLA.

mydata$Age <- as.numeric(mydata$Age)
model2 <-multinom(Type.f ~ Male + Age, data = mydata)
## # weights:  12 (6 variable)
## initial  value 138095.564686 
## iter  10 value 86003.438226
## iter  20 value 84340.937713
## final  value 84340.823598 
## converged
summary(model2)
## Call:
## multinom(formula = Type.f ~ Male + Age, data = mydata)
## 
## Coefficients:
##     (Intercept)       Male1         Age
## CIN    2.010631 -0.08921876 -0.01681853
## CPP    1.039220 -0.18518794 -0.14516870
## 
## Std. Errors:
##     (Intercept)      Male1         Age
## CIN  0.02088667 0.01714897 0.001589901
## CPP  0.02602449 0.02437257 0.002306540
## 
## Residual Deviance: 168681.6 
## AIC: 168693.6
tab_model(model2)
  Type.f
Predictors Odds Ratios CI p Response
(Intercept) 7.47 7.17 – 7.78 <0.001 CIN
Male [1] 0.91 0.88 – 0.95 <0.001 CIN
Age 0.98 0.98 – 0.99 <0.001 CIN
(Intercept) 2.83 2.69 – 2.97 <0.001 CPP
Male [1] 0.83 0.79 – 0.87 <0.001 CPP
Age 0.86 0.86 – 0.87 <0.001 CPP
Observations 125700
R2 / R2 adjusted 0.033 / 0.033

The results in the table illustrate that the odds of becoming CIN rather than CLA tend to be lower as the age increases. For every year, the odds fall approximately 2% (1-0.98 = 0.02). This means that older children are less likely to become CIN rather LAC.

The same is true when we consider the odds of CPP rather than LAC. The odds fall approximately 14%.

Underneath the table we can see the approximate R2 which is 0.03. We would therefore want to consider other potential predictors of Children’s Services Outcomes.

Step 8: Exploring changes in predicted probability associated with gender whilst holding age constant

Finally, is we want to explore the changes in predicted probability associated with Sex whilst holding age constant, we can create another dataset varying gender, but holding the Age at the sample mean.

sub.Type <- data.frame(Male = c(0,1), Age = mean(mydata$Age))
sub.Type$Male <- as.factor(sub.Type$Male)
p <- predict(model2, newdata = sub.Type, "probs")
barplot(p, ylab = "Probability", ylim=c(0,1), xlab = "Children's Service Outcome", beside = T, legend = TRUE)

For someone of average age, the predicted probability of becoming CIN for a woman is 0.79 compared to 0.79 for a man.

The probability of a male (of average age) becoming CPP is approximately 0.09 compared to 0.08 for a female.

The probability of a male (of average age) becoming CLA is approximately 0.12 compared to 0.14 for a female.