Cornelia Parker - TBlack Path (Bunhill Fields), 2013. Detail - Manchester University
Overview
Multinomial logistic regression is used in situations where the dependent variable has more than two categories. It is an extension of the basic logistic regression model. For example, if the dependent variable has three categories, A, B and C we would estimate regression models of A vs B, and A vs C. It is important to note that these models are estimated simultaneously.
This model is used to predict the probabilities of categorically dependent variable, which has more than two outcome classes. The dependent variables are nominal in nature. This means that there is no kind of ordering in the target classes ie these classes cannot be meaningfully ordered (eg 1,2,3,4 etc). Examples of multinomial logistic regression topics include:
Favorite ice cream flavor
Current employment status
Sentencing outcome status
Assumptions:
Multinomial logistic regression assumes that:
1 Each independent variable has a single value for each case.
2 The dependent variable cannot be perfectly predicted from the independent variables for each case.
3 No multicollinearity between independent variables - This occurs when two or more independent variables are highly correlated with each other. This makes it difficult to understand how much every independent variable contributes to the category of dependent variable.
4 There should be no outliers in the data points.
Step 1: Installing the packages
The following packagaes are required sjmisc and sjPlot. To create the multinomial models the nnet package needs to be installed.
library(sjmisc)
library(sjPlot)
library(nnet)
library(wakefield)
Step 2: Creating a data frame
For the purposes of this exercise the wakefield package will be used to generate a random data set. This will be a psuedo data set to resemble the number of children in local authority who are either: a Child in Need (CIN), Subject to a Child Protection Plan (CP), or child looked after (CLA).
## Using the wakefield package to easily generate reproducible "CPP" sample data
set.seed(999)
df.CPP <-
r_data_frame(
n = 12500,
r_sample(n = 255, x = "CPP", prob = NULL, name = "Type"),
id,
r_sample( x = c("Male", "Female"),
prob = c(0.52, 0.48),
name = "Sex"),
r_sample( x = c("0", "1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", "14", "15", "16", "17"),
prob = c(0.12, 0.08, 0.08, 0.08, 0.08, 0.06, 0.06, 0.06, 0.06, 0.06, 0.038, 0.038, 0.038, 0.038, 0.038, 0.038, 0.01, 0.01),
name = "Age")
)
## Using the wakefield package to easily generate reproducible "CIN" sample data
set.seed(999)
df.CIN <-
r_data_frame(
n = 97000,
r_sample(n = 255, x = "CIN", prob = NULL, name = "Type"),
id,
r_sample( x = c("Male", "Female"),
prob = c(0.55, 0.45),
name = "Sex"),
r_sample( x = c("0", "1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", "14", "15", "16", "17"),
prob = c(0.05, 0.04, 0.04, 0.04, 0.04, 0.044, 0.044, 0.044, 0.044, 0.044, 0.053, 0.053, 0.053, 0.053, 0.053, 0.053, 0.115, 0.115),
name = "Age")
)
## Using the wakefield package to easily generate reproducible "CLA" sample data
set.seed(999)
df.CLA <-
r_data_frame(
n = 16200,
r_sample(n = 255, x = "CLA", prob = NULL, name = "Type"),
id,
r_sample( x = c("Male", "Female"),
prob = c(0.57, 0.43),
name = "Sex"),
r_sample( x = c("0", "1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", "14", "15", "16", "17"),
prob = c(0.05, 0.035, 0.035, 0.035, 0.035, 0.038, 0.038, 0.038, 0.038, 0.038, 0.065, 0.065, 0.065, 0.065, 0.065, 0.065, 0.115, 0.115),
name = "Age")
)
# Merging the dataframes
mydata <- rbind(df.CIN, df.CLA, df.CPP)
head(mydata)
## # A tibble: 6 × 4
## Type ID Sex Age
## <chr> <chr> <chr> <chr>
## 1 CIN 00001 Female 16
## 2 CIN 00002 Male 10
## 3 CIN 00003 Male 2
## 4 CIN 00004 Male 14
## 5 CIN 00005 Female 14
## 6 CIN 00006 Female 16
Step 3: Exploratory data analysis
This set of data contains four variables.
The dependent varaible is type which details the Children’s Services Outcome for each child. Using multinomial modeling we will explore how this outcome varies by sex.
The next step is to start be requesting the frequencies of these two variables to check how they are coded.
frq(mydata, Type, Sex)
## Type <character>
## # total N=125700 valid N=125700 mean=1.33 sd=0.65
##
## Value | N | Raw % | Valid % | Cum. %
## ----------------------------------------
## CIN | 97000 | 77.17 | 77.17 | 77.17
## CLA | 16200 | 12.89 | 12.89 | 90.06
## CPP | 12500 | 9.94 | 9.94 | 100.00
## <NA> | 0 | 0.00 | <NA> | <NA>
##
## Sex <character>
## # total N=125700 valid N=125700 mean=1.55 sd=0.50
##
## Value | N | Raw % | Valid % | Cum. %
## -----------------------------------------
## Female | 56513 | 44.96 | 44.96 | 44.96
## Male | 69187 | 55.04 | 55.04 | 100.00
## <NA> | 0 | 0.00 | <NA> | <NA>
The most common outcome was CIN (77% of all children), followed by CPP( 13%). We can also see that there were 69,014 males and 56,686 females.
It is useful to see if there are any differences in Type by Sex with a cross-tabulation. Here we can see that CLA Children’s Services Outcomes are much more likely for males, and CPP are more common for females.
sjt.xtab(mydata$Type, mydata$Sex, show.col.prc = TRUE)
| Type | Sex | Total | |
|---|---|---|---|
| Female | Male | ||
| CIN |
43647 77.2 % |
53353 77.1 % |
97000 77.2 % |
| CLA |
6931 12.3 % |
9269 13.4 % |
16200 12.9 % |
| CPP |
5935 10.5 % |
6565 9.5 % |
12500 9.9 % |
| Total |
56513 100 % |
69187 100 % |
125700 100 % |
χ2=63.131 · df=2 · Cramer’s V=0.022 · p=0.000 |
Step 4: Run a multinomial logistic regression
To run this regression the first step is to create a new “factor” variable, Type.f that replicates the values of Type but which replaces the values with text labels.
Check that the coding of the variable has worked.
mydata$Type.f <- factor(mydata$Type, labels = c("CIN", "CLA", "CPP"))
head(mydata, n = 10)
## # A tibble: 10 × 5
## Type ID Sex Age Type.f
## <chr> <chr> <chr> <chr> <fct>
## 1 CIN 00001 Female 16 CIN
## 2 CIN 00002 Male 10 CIN
## 3 CIN 00003 Male 2 CIN
## 4 CIN 00004 Male 14 CIN
## 5 CIN 00005 Female 14 CIN
## 6 CIN 00006 Female 16 CIN
## 7 CIN 00007 Male 16 CIN
## 8 CIN 00008 Female 17 CIN
## 9 CIN 00009 Male 7 CIN
## 10 CIN 00010 Female 2 CIN
The next step of the multinomial logistic regression is to select a reference category. By default R orders the factors alphabetically, so would treat CIN as the reference group. However, it is possible to change this using the relevel command. We will choose CLA as the reference group.
mydata$Type.f <- relevel(mydata$Type.f, ref = "CLA")
It is useful to create a new dummy variable, male which has a 0 if female and 1 if male. The purpose of this is to make the gender effect clearer to see in the model.
mydata$Male <- rec(mydata$Sex, rec = "Male = 1; Female = 0; else = NA")
Finally, the multinom command is used to run a multinomial model.
The summary() command can be used once the model has converged to view the results.
model1 <-multinom(Type.f ~ Male, data = mydata)
## # weights: 9 (4 variable)
## initial value 138095.564686
## iter 10 value 87153.669778
## iter 10 value 87153.669764
## final value 87153.669764
## converged
summary(model1)
## Call:
## multinom(formula = Type.f ~ Male, data = mydata)
##
## Coefficients:
## (Intercept) Male1
## CIN 1.8401323 -0.08990659
## CPP -0.1551774 -0.18968963
##
## Std. Errors:
## (Intercept) Male1
## CIN 0.01293019 0.01714112
## CPP 0.01768551 0.02393687
##
## Residual Deviance: 174307.3
## AIC: 174315.3
Step 5: Model results
The multinomial regression model results are included in the table Coefficients: with the associated standard errors included in the table Std.Errors: . The first colum (INtercept) includes the estimates for the reference category (female), whilst column male returns the results for the effect of being male compared to the reference. Model results are given in logits, so we need to take the exponential to read these as odds and odds ratios. This can be achieved by using the exp(coef()) command.
exp(coef(model1))
## (Intercept) Male1
## CIN 6.2973712 0.9140166
## CPP 0.8562633 0.8272158
To calculate the 95% confidence interval (CI) and restructure the results to make it easier to read use the tab_model function.
tab_model(model1)
| Type.f | ||||
|---|---|---|---|---|
| Predictors | Odds Ratios | CI | p | Response |
| (Intercept) | 6.30 | 6.14 – 6.46 | <0.001 | CIN |
| Male [1] | 0.91 | 0.88 – 0.95 | <0.001 | CIN |
| (Intercept) | 0.86 | 0.83 – 0.89 | <0.001 | CPP |
| Male [1] | 0.83 | 0.79 – 0.87 | <0.001 | CPP |
| Observations | 125700 | |||
| R2 / R2 adjusted | 0.000 / 0.000 | |||
Here we can see that the odds of becoming CIN rather than CLA are 6.3 for females. Compared to females, males have 0.91 times less odds of becoming CLA rather than becoming CIN.
Similarly, the odds of becoming subject to Child Protection Plan rather than CLA are 0.83 times less for a male than a female.
Underneath the table we can see the approximate R2 which is 0.000. We would therefore want to consider other potential predictors of Children’s Services Outcomes.
Step 7: Requesting predicted probabilities
Sometimes it is desirable to request the predicted probabilities to better understand the chances of receiving different Children’s Services Outcomes. This can be achieved by using the fitted() command.
predprob <-fitted(model1)
head(predprob)
## CLA CIN CPP
## 1 0.1226447 0.7723392 0.1050161
## 2 0.1339725 0.7711328 0.0948947
## 3 0.1339725 0.7711328 0.0948947
## 4 0.1339725 0.7711328 0.0948947
## 5 0.1226447 0.7723392 0.1050161
## 6 0.1226447 0.7723392 0.1050161
The first row of data relate to males. The table illustrates that the probability of a male becoming CIN is 0.77, CPP is 0.11 and CLA is 0.12. The second row refers to females, with probabilities of CIN = 0.13, CPP = 0.09 and CLA is 0.77. All rounded to 2.d.p
Step 8: Adding a continuous predictor
It is possible to extend the model by adding in the age of the child/young person. This variable is needs to be numeric. For the purposes of this model the reference category will be CLA.
mydata$Age <- as.numeric(mydata$Age)
model2 <-multinom(Type.f ~ Male + Age, data = mydata)
## # weights: 12 (6 variable)
## initial value 138095.564686
## iter 10 value 86003.438226
## iter 20 value 84340.937713
## final value 84340.823598
## converged
summary(model2)
## Call:
## multinom(formula = Type.f ~ Male + Age, data = mydata)
##
## Coefficients:
## (Intercept) Male1 Age
## CIN 2.010631 -0.08921876 -0.01681853
## CPP 1.039220 -0.18518794 -0.14516870
##
## Std. Errors:
## (Intercept) Male1 Age
## CIN 0.02088667 0.01714897 0.001589901
## CPP 0.02602449 0.02437257 0.002306540
##
## Residual Deviance: 168681.6
## AIC: 168693.6
tab_model(model2)
| Type.f | ||||
|---|---|---|---|---|
| Predictors | Odds Ratios | CI | p | Response |
| (Intercept) | 7.47 | 7.17 – 7.78 | <0.001 | CIN |
| Male [1] | 0.91 | 0.88 – 0.95 | <0.001 | CIN |
| Age | 0.98 | 0.98 – 0.99 | <0.001 | CIN |
| (Intercept) | 2.83 | 2.69 – 2.97 | <0.001 | CPP |
| Male [1] | 0.83 | 0.79 – 0.87 | <0.001 | CPP |
| Age | 0.86 | 0.86 – 0.87 | <0.001 | CPP |
| Observations | 125700 | |||
| R2 / R2 adjusted | 0.033 / 0.033 | |||
The results in the table illustrate that the odds of becoming CIN rather than CLA tend to be lower as the age increases. For every year, the odds fall approximately 2% (1-0.98 = 0.02). This means that older children are less likely to become CIN rather LAC.
The same is true when we consider the odds of CPP rather than LAC. The odds fall approximately 14%.
Underneath the table we can see the approximate R2 which is 0.03. We would therefore want to consider other potential predictors of Children’s Services Outcomes.
Step 8: Exploring changes in predicted probability associated with gender whilst holding age constant
Finally, is we want to explore the changes in predicted probability associated with Sex whilst holding age constant, we can create another dataset varying gender, but holding the Age at the sample mean.
sub.Type <- data.frame(Male = c(0,1), Age = mean(mydata$Age))
sub.Type$Male <- as.factor(sub.Type$Male)
p <- predict(model2, newdata = sub.Type, "probs")
barplot(p, ylab = "Probability", ylim=c(0,1), xlab = "Children's Service Outcome", beside = T, legend = TRUE)
For someone of average age, the predicted probability of becoming CIN for a woman is 0.79 compared to 0.79 for a man.
The probability of a male (of average age) becoming CPP is approximately 0.09 compared to 0.08 for a female.
The probability of a male (of average age) becoming CLA is approximately 0.12 compared to 0.14 for a female.