May the odds be in your favour : A quick look into Multinomial and Ordinal logistic Regression

Author
Affiliation

Bongani Ncube(3002164)

University Of the Witwatersrand (School of Public Health)

Published

June 12, 2025

Keywords

Logistic Regression, Statistical modeling, Sensitivity analysis, logit, Risk Ratios

Generalized Linear Modeling

GLM theory suggests that the canonical link can be modeled as a linear combination of the explanatory variable(s). This approach unifies a number of modeling results used in statistics.

By so doing we have now generalized our modeling to handle non-normal responses. In addition to normally distributed responses, we are able to handle Poisson responses, binomial responses, and more. Writing a pmf or pdf for a response in one-parameter exponential family form reveals the canonical link which can be modeled as a linear function of the predictors.

Exponential Family

The theory of GLMs is developed for data with distribution given y belongs to the exponential family.
The form of the data distribution that is useful for GLMs is

\[ f(y;\theta, \phi) = \exp(\frac{\theta y - b(\theta)}{a(\phi)} + c(y, \phi)) \]

where

  • \(\theta\) is called the natural parameter
  • \(\phi\) is called the dispersion parameter

Example

if we have \(Y \sim N(\mu, \sigma^2)\)

\[ \begin{aligned} f(y; \mu, \sigma^2) &= \frac{1}{(2\pi \sigma^2)^{1/2}}\exp(-\frac{1}{2\sigma^2}(y- \mu)^2) \\ &= \exp(-\frac{1}{2\sigma^2}(y^2 - 2y \mu +\mu^2)- \frac{1}{2}\log(2\pi \sigma^2)) \\ &= \exp(\frac{y \mu - \mu^2/2}{\sigma^2} - \frac{y^2}{2\sigma^2} - \frac{1}{2}\log(2\pi \sigma^2)) \\ &= \exp(\frac{\theta y - b(\theta)}{a(\phi)} + c(y , \phi)) \end{aligned} \]

where

  • \(\theta = \mu\)
  • \(b(\theta) = \frac{\mu^2}{2}\)
  • \(a(\phi) = \sigma^2 = \phi\)
  • \(c(y , \phi) = - \frac{1}{2}(\frac{y^2}{\phi}+\log(2\pi \sigma^2))\)

Properties of GLM exponential families

  1. \(E(Y) = b' (\theta)\) where \(b'(\theta) = \frac{\partial b(\theta)}{\partial \theta}\)

  2. \(var(Y) = a(\phi)b''(\theta)= a(\phi)V(\mu)\).

    • \(V(\mu)\) is the variance function; however, it is only the variance in the case that \(a(\phi) =1\)
  3. If \(a(), b(), c()\) are identifiable, we will derive expected value and variance of Y.

Examples

Normal distribution

\[ \begin{aligned} b'(\theta) &= \frac{\partial b(\mu^2/2)}{\partial \mu} = \mu \\ V(\mu) &= \frac{\partial^2 (\mu^2/2)}{\partial \mu^2} = 1 \\ &\to var(Y) = a(\phi) = \sigma^2 \end{aligned} \]

Poisson distribution

\[ \begin{aligned} f(y, \theta, \phi) &= \frac{\mu^y \exp(-\mu)}{y!} \\ &= \exp(y\log(\mu) - \mu - \log(y!)) \\ &= \exp(y\theta - \exp(\theta) - \log(y!)) \end{aligned} \]

where

  • \(\theta = \log(\mu)\)
  • \(a(\phi) = 1\)
  • \(b(\theta) = \exp(\theta)\)
  • \(c(y, \phi) = \log(y!)\)

Hence,

\[ \begin{aligned} E(Y) &= \frac{\partial b(\theta)}{\partial \theta} = \exp(\theta) = \mu \\ var(Y) &= \frac{\partial^2 b(\theta)}{\partial \theta^2} = \mu \end{aligned} \]

Since \(\mu = E(Y) = b'(\theta)\)

In GLM, we take some monotone function (typically nonlinear) of \(\mu\) to be linear in the set of covariates

\[ g(\mu) = g(b'(\theta)) = \mathbf{x'\beta} \]

Equivalently,

\[ \mu = g^{-1}(\mathbf{x'\beta}) \]

where \(g(.)\) is the link function since it links mean response (\(\mu = E(Y)\)) and a linear expression of the covariates

Some people use \(\eta = \mathbf{x'\beta}\) where \(\eta\) = the “linear predictor”

GLM is composed of 2 components

The random component:

  • is the distribution chosen to model the response variables \(Y_1,...,Y_n\)

  • is specified by the choice fo \(a(), b(), c()\) in the exponential form

  • Notation:

    • Assume that there are n independent response variables \(Y_1,...,Y_n\) with densities
      \[ f(y_i ; \theta_i, \phi) = \exp(\frac{\theta_i y_i - b(\theta_i)}{a(\phi)}+ c(y_i, \phi)) \] notice each observation might have different densities
    • Assume that \(\phi\) is constant for all \(i = 1,...,n\), but \(\theta_i\) will vary. \(\mu_i = E(Y_i)\) for all i.

The systematic component

  • is the portion of the model that gives the relation between \(\mu\) and the covariates \(\mathbf{x}\)

  • consists of 2 parts:

    • the link function, \(g(.)\)
    • the linear predictor, \(\eta = \mathbf{x'\beta}\)
  • Notation:

    • assume \(g(\mu_i) = \mathbf{x'\beta} = \eta_i\) where \(\mathbf{\beta} = (\beta_1,..., \beta_p)'\)
    • The parameters to be estimated are \(\beta_1,...\beta_p , \phi\)

The Canonical Link

To choose \(g(.)\), we can use canonical link function (Remember: Canonical link is just a special case of the link function)

If the link function \(g(.)\) is such \(g(\mu_i) = \eta_i = \theta_i\), the natural parameter, then \(g(.)\) is the canonical link.

The inverse link

\(g^{-1}(.)\) is also known as the mean function, take linear predictor output (ranging from \(-\infty\) to \(\infty\)) and transform it into a different scale.

  • Exponential: converts \(\mathbf{\beta X}\) into a curve that is restricted between 0 and \(\infty\) (which you can see that is useful in case you want to convert a linear predictor into a non-negative value). \(\lambda = \exp(y) = \mathbf{\beta X}\)

  • Inverse Logit (also known as logistic): converts \(\mathbf{\beta X}\) into a curve that is restricted between 0 and 1, which is useful in case you want to convert a linear predictor to a probability. \(\theta = \frac{1}{1 + \exp(-y)} = \frac{1}{1 + \exp(- \mathbf{\beta X})}\)

    • \(y\) = linear predictor value
    • \(\theta\) = transformed value

The identity link is that

\[ \begin{aligned} \eta_i &= g(\mu_i) = \mu_i \\ \mu_i &= g^{-1}(\eta_i) = \eta_i\\ \end{aligned} \]

Examples

Normal random component

  • Mean Response: \(\mu_i = \theta_i\)

  • Canonical Link: \(g( \mu_i) = \mu_i\) (the identity link)

Binomial random component

  • Mean Response: \(\mu_i = \frac{n_i \exp( \theta)}{1+\exp (\theta_i)}\) and \(\theta(\mu_i) = \log(\frac{p_i }{1-p_i}) = \log (\frac{\mu_i} {n_i - \mu_i})\)

  • Canonical link: \(g(\mu_i) = \log(\frac{\mu_i} {n_i - \mu_i})\) (logit link)

Poisson random component

  • Mean Response: \(\mu_i = \exp(\theta_i)\)

  • Canonical Link: \(g(\mu_i) = \log(\mu_i)\)

Gamma random component:

  • Mean response: \(\mu_i = -\frac{1}{\theta_i}\) and \(\theta(\mu_i) = - \mu_i^{-1}\)

  • Canonical Link: \(g(\mu\_i) = - \frac{1}{\mu_i}\)

Inverse Gaussian random

  • Canonical Link: \(g(\mu_i) = \frac{1}{\mu_i^2}\)

Logistic Regression

If \(X\) follows a binomial distribution with \(n\) trials and probability of success \(p\), we can write:

\[\begin{align*} P(X=x)&= \binom{n}{x}p^x(1-p)^{(n-x)} \\ &=e^{x\log(p) + (n-x)\log(1-p) + \log\binom{n}{x}} \end{align*}\] However, this probability mass function is not quite in one-parameter exponential family form. Note that there are two terms in the exponent which consist of a product of functions of \(x\) and \(p\). So more simplification is in order:

\[\begin{equation*} P(X=x) = e^{x\log\left(\frac{p}{1-p}\right) + n\log(1-p)+ \log\binom{n}{x}} \end{equation*}\].

The one-parameter exponential family form for binomial responses shows that the canonical link is \(\log\left(\frac{p}{1-p}\right)\). Thus, GLM theory suggests that constructing a model using the logit, the log odds of a score, as a linear function of covariates is a reasonable approach.

\[ p_i = f(\mathbf{x}_i ; \beta) = \frac{exp(\mathbf{x_i'\beta})}{1 + exp(\mathbf{x_i'\beta})} \]

Equivalently,

\[ logit(p_i) = log(\frac{p_i}{1-p_i}) = \mathbf{x_i'\beta} \]

where \(\frac{p_i}{1-p_i}\)is the odds.

In this form, the model is specified such that a function of the mean response is linear. Hence, Generalized Linear Models

The likelihood function

\[L(p_i) = \prod_{i=1}^{n} p_i^{Y_i}(1-p_i)^{1-Y_i}\]

where \[p_i = \frac{\mathbf{exp(x'_i \beta})}{1+\mathbf{exp(x'_i \beta)}}\] and \(1-p_i = \frac{1}{1+ exp(\mathbf{x'_i \beta})}\)

Hence, our objective function is

\[ \begin{aligned} Q(\beta) &= log(L(p_i))\\ &= log\left(p_i^{\sum Y_i}(1-p_i)^{\sum(1-Y_i)}\right)\\ &= log\left(\frac{\mathbf{exp(x'_i \beta})}{1+\mathbf{exp(x'_i \beta)}})^{\sum Y_i}(\frac{1}{1+ exp(\mathbf{x'_i \beta})})^{\sum(1-Y_i)}\right)\\ &=\sum Y_ix'_i\beta - \sum Y_ilog(1+\exp(x'_i\beta))- \sum(1-Y_i)log(1+\exp(x'_i\beta))\\ &=\sum Y_ix'_i\beta - \sum Y_ilog(1+\exp(x'_i\beta))+\sum Y_ilog(1+\exp(x'_i\beta))- \sum log(1+\exp(x'_i\beta))\\ &= \sum_{i=1}^n Y_i \mathbf{x'_i \beta} - \sum_{i=1}^n log(1+ exp(\mathbf{x'_i \beta})) \end{aligned} \]

we could maximize this function numerically using the optimization method above, which allows us to find numerical MLE for \(\hat{\beta}\). Then we can use the standard asymptotic properties of MLEs to make inference.

Property of MLEs is that parameters are asymptotically unbiased with sample variance-covariance matrix given by the inverse Fisher information matrix

\[ \hat{\beta} \dot{\sim} AN(\beta,[\mathbf{I}(\beta)]^{-1}) \]

where the Fisher Information matrix, \(\mathbf{I}(\beta)\) is

\[ \begin{aligned} \mathbf{I}(\beta) &= E[\frac{\partial \log(L(\beta))}{\partial (\beta)}\frac{\partial \log(L(\beta))}{\partial \beta'}] \\ &= E[(\frac{\partial \log(L(\beta))}{\partial \beta_i} \frac{\partial \log(L(\beta))}{\partial \beta_j})_{ij}] \end{aligned} \]

Under regularity conditions, this is equivalent to the negative of the expected value of the Hessian Matrix

\[ \begin{aligned} \mathbf{I}(\beta) &= -E[\frac{\partial^2 \log(L(\beta))}{\partial \beta \partial \beta'}] \\ &= -E[(\frac{\partial^2 \log(L(\beta))}{\partial \beta_i \partial \beta_j})_{ij}] \end{aligned} \]

Example:

\[ x_i' \beta = \beta_0 + \beta_1 x_i \]

\[ \begin{aligned} - \frac{\partial^2 \ln(L(\beta))}{\partial \beta^2_0} &= \sum_{i=1}^n \frac{\exp(x'_i \beta)}{1 + \exp(x'_i \beta)} - [\frac{\exp(x_i' \beta)}{1+ \exp(x'_i \beta)}]^2 = \sum_{i=1}^n p_i (1-p_i) \\ - \frac{\partial^2 \ln(L(\beta))}{\partial \beta^2_1} &= \sum_{i=1}^n \frac{x_i^2\exp(x'_i \beta)}{1 + \exp(x'_i \beta)} - [\frac{x_i\exp(x_i' \beta)}{1+ \exp(x'_i \beta)}]^2 = \sum_{i=1}^n x_i^2p_i (1-p_i) \\ - \frac{\partial^2 \ln(L(\beta))}{\partial \beta_0 \partial \beta_1} &= \sum_{i=1}^n \frac{x_i\exp(x'_i \beta)}{1 + \exp(x'_i \beta)} - x_i[\frac{\exp(x_i' \beta)}{1+ \exp(x'_i \beta)}]^2 = \sum_{i=1}^n x_ip_i (1-p_i) \\ \end{aligned} \]

Hence,

\[ \mathbf{I} (\beta) = \left[ \begin{array} {cc} \sum_i p_i(1-p_i) & \sum_i x_i p_i(1-p_i) \\ \sum_i x_i p_i(1-p_i) & \sum_i x_i^2 p_i(1-p_i) \end{array} \right] \]

Inference

Likelihood Ratio Tests

To formulate the test, let \(\beta = [\beta_1', \beta_2']'\). If you are interested in testing a hypothesis about \(\beta_1\), then we leave \(\beta_2\) unspecified (called nuisance parameters). \(\beta_1\) and \(\beta_2\) can either a vector or scalar, or \(\beta_2\) can be null.

Example: \(H_0: \beta_1 = \beta_{1,0}\) (where \(\beta_{1,0}\) is specified) and \(\hat{\beta}_{2,0}\) be the MLE of \(\beta_2\) under the restriction that \(\beta_1 = \beta_{1,0}\). The likelihood ratio test statistic is

\[ -2\log\Lambda = -2[\log(L(\beta_{1,0},\hat{\beta}_{2,0})) - \log(L(\hat{\beta}_1,\hat{\beta}_2))] \]

where

  • the first term is the value fo the likelihood for the fitted restricted model
  • the second term is the likelihood value of the fitted unrestricted model

Under the null,

\[ -2 \log \Lambda \sim \chi^2_{\upsilon} \]

where \(\upsilon\) is the dimension of \(\beta_1\)

We reject the null when \(-2\log \Lambda > \chi_{\upsilon,1-\alpha}^2\)

Wald Statistics

Based on

\[ \hat{\beta} \sim AN (\beta, [\mathbf{I}(\beta)^{-1}]) \]

\[ H_0: \mathbf{L}\hat{\beta} = 0 \]

where \(\mathbf{L}\) is a q x p matrix with q linearly independent rows. Then

\[ W = (\mathbf{L\hat{\beta}})'(\mathbf{L[I(\hat{\beta})]^{-1}L'})^{-1}(\mathbf{L\hat{\beta}}) \]

under the null hypothesis

Confidence interval

\[ \hat{\beta}_i \pm 1.96 \hat{s}_{ii}^2 \]

where \(\hat{s}_{ii}^2\) is the i-th diagonal of \(\mathbf{[I(\hat{\beta})]}^{-1}\)

If you have

  • large sample size, the likelihood ratio and Wald tests have similar results.
  • small sample size, the likelihood ratio test is better.

Logistic Regression: Interpretation of \(\beta\)

For single regressor, the model is

\[ logit\{\hat{p}_{x_i}\} \equiv logit (\hat{p}_i) = \log(\frac{\hat{p}_i}{1 - \hat{p}_i}) = \hat{\beta}_0 + \hat{\beta}_1 x_i \]

When \(x= x_i + 1\)

\[ logit\{\hat{p}_{x_i +1}\} = \hat{\beta}_0 + \hat{\beta}(x_i + 1) = logit\{\hat{p}_{x_i}\} + \hat{\beta}_1 \]

Then,

\[ logit\{\hat{p}_{x_i +1}\} - logit\{\hat{p}_{x_i}\} = log\{odds[\hat{p}_{x_i +1}]\} - log\{odds[\hat{p}_{x_i}]\} \\ = log(\frac{odds[\hat{p}_{x_i + 1}]}{odds[\hat{p}_{x_i}]}) = \hat{\beta}_1 \]

and

\[ exp(\hat{\beta}_1) = \frac{odds[\hat{p}_{x_i + 1}]}{odds[\hat{p}_{x_i}]} \]

the estimated odds ratio

the estimated odds ratio, when there is a difference of c units in the regressor x, is \(exp(c\hat{\beta}_1)\). When there are multiple covariates, \(exp(\hat{\beta}_k)\) is the estimated odds ratio for the variable \(x_k\), assuming that all of the other variables are held constant.

Residuals for Binomial Regression

With LLSR, residuals were used to assess model assumptions and identify outliers. For binomial regression, two different types of residuals are typically used. One residual, the Pearson residual, has a form similar to that used with LLSR. Specifically, the Pearson residual is calculated using:

\[\begin{equation*} \textrm{Pearson residual}_i = \frac{\textrm{actual count}-\textrm{predicted count}}{\textrm{SD of count}} = \frac{Y_i-m_i\hat{p_i}}{\sqrt{m_i\hat{p_i}(1-\hat{p_i})}} \end{equation*}\] where \(m_i\) is the number of trials for the \(i^{th}\) observation and \(\hat{p}_i\) is the estimated probability of success for that same observation.

A deviance residual is an alternative residual for binomial regression based on the discrepancy between the observed values and those estimated using the likelihood. A deviance residual can be calculated for each observation using:

\[\begin{equation*} \textrm{d}_i = \textrm{sign}(Y_i-m_i\hat{p_i})\sqrt{2[Y_i \log\left(\frac{Y_i}{m_i \hat{p_i}}\right)+ (m_i - Y_i) \log\left(\frac{m_i - Y_i}{m_i - m_i \hat{p_i}}\right)]} \end{equation*}\]

When the number of trials is large for all of the observations and the models are appropriate, both sets of residuals should follow a standard normal distribution.

The sum of the individual deviance residuals is referred to as the deviance or residual deviance. The residual deviance is used to assess the model. As the name suggests, a model with a small deviance is preferred. In the case of binomial regression, when the denominators, \(m_i\), are large and a model fits, the residual deviance follows a \(\chi^2\) distribution with \(n-p\) degrees of freedom (the residual degrees of freedom). Thus for a good fitting model the residual deviance should be approximately equal to its corresponding degrees of freedom. When binomial data meets these conditions, the deviance can be used for a goodness-of-fit test. The p-value for lack-of-fit is the proportion of values from a \(\chi_{n-p}^2\) distribution that are greater than the observed residual deviance.

Application in R and STATA

About the dataset
age sex cp trestbps chol fbs restecg thalach exang oldpeak slope ca thal target
52 1 0 125 212 0 1 168 0 1.0 2 2 3 0
53 1 0 140 203 1 0 155 1 3.1 0 0 3 0
70 1 0 145 174 0 1 125 1 2.6 0 0 3 0
61 1 0 148 203 0 1 161 0 0.0 2 1 3 0
62 0 0 138 294 1 1 106 0 1.9 1 3 2 0
58 0 0 100 248 0 0 122 0 1.0 1 0 2 1

single variable discriptive statistics

quietly import delimited heart

label define sex 1 "male" 0 "female", replace
label value sex sex
label define fbs 1 "true" 0 "false", replace
label value fbs fbs
label define target 0 "No disease" 1 "Disease", replace
label value target target
label define thal 1 "normal" 2 "fixed defect" 3 "reversible defect", replace
label value thal thal
label define exang 1 "yes" 0 "no", replace
label value exang exang

quietly save heart , replace
quietly import delimited heart

dtable, continuous(age trestbps chol thalach oldpeak) factor(exang sex target thal fbs restecg cp ca slope)  title(Table1) titlestyles( font(, bold) ) export("Statin_table", as(docx) replace)
Table1
-------------------------
              Summary    
-------------------------
N                   1,025
age        54.434 (9.072)
trestbps 131.612 (17.517)
chol     246.000 (51.593)
thalach  149.114 (23.006)
oldpeak     1.072 (1.175)
exang                    
  0           680 (66.3%)
  1           345 (33.7%)
sex                      
  0           312 (30.4%)
  1           713 (69.6%)
target                   
  0           499 (48.7%)
  1           526 (51.3%)
thal                     
  0              7 (0.7%)
  1             64 (6.2%)
  2           544 (53.1%)
  3           410 (40.0%)
fbs                      
  0           872 (85.1%)
  1           153 (14.9%)
restecg                  
  0           497 (48.5%)
  1           513 (50.0%)
  2             15 (1.5%)
cp                       
  0           497 (48.5%)
  1           167 (16.3%)
  2           284 (27.7%)
  3             77 (7.5%)
ca                       
  0           578 (56.4%)
  1           226 (22.0%)
  2           134 (13.1%)
  3             69 (6.7%)
  4             18 (1.8%)
slope                    
  0             74 (7.2%)
  1           482 (47.0%)
  2           469 (45.8%)
-------------------------
(collection DTable exported to file Statin_table.docx)
Comments
  • The table above shows frequencies and proportions for categorical data.
  • Continuous data was summarised by means and standard deviations
use heart 

quietly histogram age ,title("Histogram of Age") note( "Source : Data source unknown") ytitle("Fraction") normal ytitle("Density") name(graph1 , replace)

quietly histogram trestbps ,title("Histogram of trestbps") note( "Source : Data source unknown") ytitle("Fraction") normal ytitle("Density") name(graph2 , replace)

quietly histogram chol ,title("Histogram of chol") note( "Source : Data source unknown") ytitle("Fraction") normal ytitle("Density") name(graph3 , replace)

quietly histogram thalach ,title("Histogram of thalach") note( "Source : Data source unknown") ytitle("Fraction") normal ytitle("Density") name(graph4 , replace)

quietly histogram oldpeak ,title("Histogram of oldpeak") note( "Source : Data source unknown") ytitle("Fraction") normal ytitle("Density") name(graph5 , replace)

graph pie, over(sex)  name(graph6 , replace) title("distribution of sex")

graph pie, over(exang) name(graph7 , replace) title("distribution of exang")

graph hbar (count), over(cp) blabel(bar) name(graph8 , replace) title("distribution of cp")

graph hbar (count), over(ca) blabel(bar) name(graph9 , replace) title("distribution of ca")

graph hbar (count), over(fbs) blabel(bar) name(graph10 , replace) title("distribution of fbs")

graph hbar (count), over(thal) blabel(bar) name(graph11 , replace) title("distribution of thal")

graph hbar (count), over(restecg) blabel(bar) name(graph12 , replace) title("distribution of restecg")

graph hbar (count), over(slope) blabel(bar) name(graph13 , replace) title("distribution of slope")

graph pie, over(target) name(graph14 , replace) title("distribution of target")

graph combine graph1 graph2 graph3 graph4 graph5 graph6 graph7 graph8 graph9 graph10 graph11 graph12 graph13 graph14  ,title("")  

quietly graph export all_graphs.svg, replace

Stata Graph - Graph 0 .02 .04 .06 .08 Density 30 40 50 60 70 80 age Source : Data source unknown Histogram of Age 0 .01 .02 .03 .04 Density 100 120 140 160 180 200 trestbps Source : Data source unknown Histogram of trestbps 0 .002 .004 .006 .008 Density 100 200 300 400 500 600 chol Source : Data source unknown Histogram of chol 0 .005 .01 .015 .02 Density 50 100 150 200 thalach Source : Data source unknown Histogram of thalach 0 .5 1 1.5 2 Density 0 2 4 6 oldpeak Source : Data source unknown Histogram of oldpeak female male distribution of sex no yes distribution of exang 77 284 167 497 0 100 200 300 400 500 frequency 3 2 1 0 distribution of cp 18 69 134 226 578 0 200 400 600 frequency 4 3 2 1 0 distribution of ca 153 872 0 200 400 600 800 frequency true false distribution of fbs 410 544 64 7 0 200 400 600 frequency reversible defect fixed defect normal 0 distribution of thal 15 513 497 0 100 200 300 400 500 frequency 2 1 0 distribution of restecg 469 482 74 0 100 200 300 400 500 frequency 2 1 0 distribution of slope No disease Disease distribution of target

Summary statistics by Outcome Variable

quietly import delimited heart

dtable,by(target, tests testnotes) continuous(age trestbps chol thalach oldpeak) factor(exang sex thal fbs restecg cp ca slope)  title(Table1) titlestyles( font(, bold) ) export("target", as(docx) replace)
note: using test regress across levels of target for age, trestbps, chol,
      thalach, and oldpeak.
note: using test pearson across levels of target for exang, sex, thal, fbs,
      restecg, cp, ca, and slope.

Table1
------------------------------------------------------------------
                                   target                         
                 0                1              Total       Test 
------------------------------------------------------------------
N             499 (48.7%)      526 (51.3%)   1,025 (100.0%)       
age        56.569 (7.908)   52.409 (9.632)   54.434 (9.072) <0.001
trestbps 134.106 (18.577) 129.245 (16.112) 131.612 (17.517) <0.001
chol     251.293 (49.559) 240.979 (53.010) 246.000 (51.593)  0.001
thalach  139.130 (22.565) 158.586 (19.097) 149.114 (23.006) <0.001
oldpeak     1.600 (1.291)    0.570 (0.771)    1.072 (1.175) <0.001
exang                                                             
  0           225 (45.1%)      455 (86.5%)      680 (66.3%) <0.001
  1           274 (54.9%)       71 (13.5%)      345 (33.7%)       
sex                                                               
  0            86 (17.2%)      226 (43.0%)      312 (30.4%) <0.001
  1           413 (82.8%)      300 (57.0%)      713 (69.6%)       
thal                                                              
  0              4 (0.8%)         3 (0.6%)         7 (0.7%) <0.001
  1             43 (8.6%)        21 (4.0%)        64 (6.2%)       
  2           132 (26.5%)      412 (78.3%)      544 (53.1%)       
  3           320 (64.1%)       90 (17.1%)      410 (40.0%)       
fbs                                                               
  0           417 (83.6%)      455 (86.5%)      872 (85.1%)  0.188
  1            82 (16.4%)       71 (13.5%)      153 (14.9%)       
restecg                                                           
  0           283 (56.7%)      214 (40.7%)      497 (48.5%) <0.001
  1           204 (40.9%)      309 (58.7%)      513 (50.0%)       
  2             12 (2.4%)         3 (0.6%)        15 (1.5%)       
cp                                                                
  0           375 (75.2%)      122 (23.2%)      497 (48.5%) <0.001
  1             33 (6.6%)      134 (25.5%)      167 (16.3%)       
  2            65 (13.0%)      219 (41.6%)      284 (27.7%)       
  3             26 (5.2%)        51 (9.7%)        77 (7.5%)       
ca                                                                
  0           163 (32.7%)      415 (78.9%)      578 (56.4%) <0.001
  1           160 (32.1%)       66 (12.5%)      226 (22.0%)       
  2           113 (22.6%)        21 (4.0%)      134 (13.1%)       
  3            60 (12.0%)         9 (1.7%)        69 (6.7%)       
  4              3 (0.6%)        15 (2.9%)        18 (1.8%)       
slope                                                             
  0             46 (9.2%)        28 (5.3%)        74 (7.2%) <0.001
  1           324 (64.9%)      158 (30.0%)      482 (47.0%)       
  2           129 (25.9%)      340 (64.6%)      469 (45.8%)       
------------------------------------------------------------------
(collection DTable exported to file target.docx)
Comments
  • Preliminary results from Chi-square tests of association and student T.tests indicate that at \(5\%\) level of significance, most variables are are associated with being diagnosed or not being diagnosed with the disease.
use heart 

graph box age , over(target) title("distribution of Age by target") note( "Source : Data source unknown") ytitle("Fraction")  ytitle("Density") name(graph1 , replace)

graph box trestbps , over(target) title("distribution of trestbps by target") note( "Source : Data source unknown") ytitle("Fraction")  ytitle("Density") name(graph2 , replace)

graph box chol ,over(target) title("distribution of chol by target") note( "Source : Data source unknown") ytitle("Fraction")  ytitle("Density") name(graph3 , replace)

graph box thalach ,over(target) title("distribution of thalach by target") note( "Source : Data source unknown") ytitle("Fraction") ytitle("Density") name(graph4 , replace)

graph box oldpeak ,over(target) title("distribution of oldpeak by target") note( "Source : Data source unknown") ytitle("Fraction")  ytitle("Density") name(graph5 , replace)

graph hbar ,over(target) over(sex) stack asyvars name(graph6 , replace) title("distribution of sex")

graph hbar ,over(target) over(exang) stack asyvars name(graph7 , replace) title("distribution of exang")

graph hbar ,over(target) over(cp) stack asyvars name(graph8 , replace) title("distribution of cp")

graph hbar ,over(target) over(ca)  stack asyvars name(graph9 , replace) title("distribution of ca")

graph hbar,over(target) over(fbs) stack asyvars name(graph10 , replace) title("distribution of fbs")

graph hbar ,over(target) over(thal) stack asyvars name(graph11 , replace) title("distribution of thal")

graph hbar ,over(target) over(restecg) stack asyvars name(graph12 , replace) title("distribution of restecg")

graph hbar ,over(target) over(slope)  stack asyvars name(graph13 , replace) title("distribution of slope")



graph combine graph1 graph2 graph3 graph4 graph5 graph6 graph7 graph8 graph9 graph10 graph11 graph12 graph13  ,title("") 

quietly graph export all_target.svg, replace

Stata Graph - Graph 30 40 50 60 70 80 Density No disease Disease Source : Data source unknown distribution of Age by target 100 120 140 160 180 200 Density No disease Disease Source : Data source unknown distribution of trestbps by target 100 200 300 400 500 600 Density No disease Disease Source : Data source unknown distribution of chol by target 50 100 150 200 Density No disease Disease Source : Data source unknown distribution of thalach by target 0 2 4 6 Density No disease Disease Source : Data source unknown distribution of oldpeak by target 0 20 40 60 80 percent male female distribution of sex No disease Disease 0 20 40 60 80 percent yes no distribution of exang No disease Disease 0 10 20 30 40 50 percent 3 2 1 0 distribution of cp No disease Disease 0 20 40 60 percent 4 3 2 1 0 distribution of ca No disease Disease 0 20 40 60 80 percent true false distribution of fbs No disease Disease 0 10 20 30 40 50 percent reversible defect fixed defect normal 0 distribution of thal No disease Disease 0 10 20 30 40 50 percent 2 1 0 distribution of restecg No disease Disease 0 10 20 30 40 50 percent 2 1 0 distribution of slope No disease Disease

Fit a simple logistic regression model using age as the only risk factor of heart disease status

Recap

A logistic regression model is a generalized linear model and is made of a linear predictor:

\[\underbrace{g(\mu_i)}_{Link~~function} = \underbrace{\beta_0 + \beta_1X_1~+~...~+~\beta_pX_p}_{Linear~component}\]

Note
  • the equation below shows a simple logistic regression model fit on age as the predictor.

\[ \begin{aligned} \log(\frac{\mu_i}{1-\mu_i})= 2.943003-.0530036Age \end{aligned}\]

(b) Report the estimated coefficient for age, and interpret its sign and meaning in the context of heart disease risk

Comments
  • the model for the output above can be written as follows

\[ \begin{aligned} \text{logit}(\mu_i) &= \log(\frac{\mu_i}{1-\mu_i})\\ &= 2.943003-.0530036Age \end{aligned}\]

  • The estimated coefficient for age is -0.0530036 and it is negative
  • Here we see that for a 1 year increase in age , the log odds of being diagnosed with heart disease decrease significantly by 0.0530036
  • Thus the chances of developing heart disease decrease with age

Is Age significantly associated with heart disease?

  • Age is indeed significantly associated with probability of developing heart disease at 5% level of significance since the corresponding p-value is 0.000 and is less than 0.05
  • the coefficient’s confidence is given by \([-0.0674686 ,-0.0385386]\) and does not contain zero hence the variable is significant.
Determining the odds

\[ \begin{aligned} \log(\frac{\mu_i}{1-\mu_i})&= 2.943003-.0530036Age\\ \frac{\mu_i}{1-\mu_i}&=e^{2.943003-.0530036Age}\\ \frac{\mu_i}{1-\mu_i}&=e^{2.943003}.e^{-.0530036Age} \end{aligned}\]

  • the odds associated with age is therefore \(e^{-.0530036}=0.9483766\) meaning that for a 1 year increase in age the odds of being diagnosed with heart disease decrease by \((1-0.9483766)*100\% \approx 5.16\%\)

Code below shows how to get the estimated odds ratios

quietly import delimited heart

logistic target age
Logistic regression                                     Number of obs =  1,025
                                                        LR chi2(1)    =  55.17
                                                        Prob > chi2   = 0.0000
Log likelihood = -682.53284                             Pseudo R2     = 0.0388

------------------------------------------------------------------------------
      target | Odds ratio   Std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
         age |   .9483766   .0069993    -7.18   0.000     .9347571    .9621946
       _cons |   18.97274   7.749939     7.20   0.000      8.51988    42.25001
------------------------------------------------------------------------------
Note: _cons estimates baseline odds.
Tip
  • also note that we can find the probability of being diagnosed at a given age by the following probability formula.

\[ \begin{aligned} \hat{p}(x) &= \frac{exp^{\hat{\beta}_{0} + \hat{\beta}_{1}{age}}}{1 + exp^{\hat{\beta}_{0} + \hat{\beta}_{1}{age}}}\\ &= \frac{exp^{2.943003-.0530036Age}}{1 + exp^{2.943003-.0530036Age}} \end{aligned}\]

likelihood ratio test

Comments
  • The likelihood ratio test above compares the null model with a model with age alone.
  • the \(p-value<0.001\) indicates that the model with age is better fit compared to a model with no covariates at all.
  • thus our current model fits the data better than a model with no predictors.

Add sex variable

Comments
  • the model for the output above can be written as follows

\[ \begin{aligned} \text{logit}(\mu_i) &= \log(\frac{\mu_i}{1-\mu_i})\\ &= 4.819817-.0674349Age-1.533812Male \end{aligned}\]

  • The estimated coefficient for age is -.0674349 and it is negative
  • Here we see that for a 1 year increase in age , the log odds of being diagnosed with heart disease decrease significantly by 0.0674349 while adjusting for gender of the patient.
  • whilst Adjusting for Age, males have 1.533812 less log odds of being diagnosed with heart disease as compared to females.
  • the two variables are significant at 5% level of significance since they have p-values less than 5%

Confidence Intervals for Model Coefficients

Since the Wald statistic follows a normal distribution with \(n\) large, we could generate a Wald-type (normal-based) confidence interval for \(\beta_2\) using:

\(\hat\beta_2 \pm 1.96\cdot\textrm{SE}(\hat\beta_2)\) and then exponentiating endpoints if we prefer a confidence interval for the odds ratio \(e^{\beta_2}\). In this case,

\[ \begin{align*} 95\% \textrm{ CI for } \beta_2 &= \hat{\beta}_2 \pm 1.96 \cdot \textrm{SE}(\hat{\beta}_2) \\ &= -1.533812 \pm 1.96 \cdot 0.158526 \\ &= -0.0132 \pm 0.00764 \\ &= (-1.223101, -1.844523) \\ 95\% \textrm{ CI for } e^{\beta_2} &= (e^{-1.223101}, e^{-1.844523}) \\ &= (0.2943161, 0.1581007) \\ \end{align*} \]

confint(mod2)
Waiting for profiling to be done...
                  2.5 %      97.5 %
(Intercept)  3.88237336  5.79217853
age         -0.08342613 -0.05190659
sex         -1.84937968 -1.22746318
exp(confint(mod2))
Waiting for profiling to be done...
                 2.5 %      97.5 %
(Intercept) 48.5392799 327.7262070
age          0.9199590   0.9494175
sex          0.1573347   0.2930350

Thus, we can be 95% confident that being male is associated with a 84% to 70.7% decrease in the odds of dying after controlling age.

Multinomial logistic Regression

If we have more than two categories or groups that we want to model relative to covariates (e.g., we have observations \(i = 1,…,n\) and groups/ covariates \(j = 1,2,…,J\)), multinomial is our candidate model

Let

  • \(p_{ij}\) be the probability that the i-th observation belongs to the j-th group
  • \(Y_{ij}\) be the number of observations for individual i in group j; An individual will have observations \(Y_{i1},Y_{i2},…Y_{iJ}\)
  • assume the probability of observing this response is given by a multinomial distribution in terms of probabilities \(p_{ij}\), where \(\sum_{j = 1}^J p_{ij} = 1\) . For interpretation, we have a baseline category \(p_{i1} = 1 - \sum_{j = 2}^J p_{ij}\)

The link between the mean response (probability) \(p_{ij}\) and a linear function of the covariates

\[ \eta_{ij} = \mathbf{x'_i \beta_j} = \log \frac{p_{ij}}{p_{i1}}, j = 2,..,J \]

We compare \(p_{ij}\) to the baseline \(p_{i1}\), suggesting

\[ p_{ij} = \frac{\exp(\eta_{ij})}{1 + \sum_{i=2}^J \exp(\eta_{ij})} \]

which is known as multinomial logistic model.

Note:

  • Softmax coding for multinomial logistic regression: rather than selecting a baseline class, we treat all K class symmetrically - equally important (no baseline).

\[ P(Y = k | X = x) = \frac{exp(\beta_{k1} + \dots + \beta_{k_p x_p})}{\sum_{l = 1}^K exp(\beta_{l0} + \dots + \beta_{l_p x_p})} \]

then the log odds ratio between k-th and k’-th classes is

\[ \log (\frac{P(Y=k|X=x)}{P(Y = k' | X=x)}) = (\beta_{k0} - \beta_{k'0}) + \dots + (\beta_{kp} - \beta_{k'p}) x_p \]

About the dataset

You are an analyst at a large technology company. The company recently introduced a new health insurance provider for its employees. At the beginning of the year the employees had to choose one of three different health plan products from this provider to best suit their needs. You have been asked to determine which factors influenced the choice in product.

Thehealth_insurance` data set consists of the following fields:

  • product: The choice of product of the individual—A, B or C
  • age: The age of the individual when they made the choice
  • gender: The gender of the individual as stated when they made the choice
  • household: The number of people living with the individual in the same household at the time of the choice
  • position_level: Position level in the company at the time they made the choice, where 1 is is the lowest and 5 is the highest
  • absent: The number of days the individual was absent from work in the year prior to the choice

Descriptive statistics for single variables

quietly import delimited health_insurance 
egen product2 = group(product), label
egen position_level2 = group(position_level), label
egen gender2 = group(gender), label
drop product
drop position_level
drop gender
rename product2 product
rename gender2 gender
rename position_level2 position_level
quietly save health_insurance , replace
dtable, continuous(age absent household) factor(gender product position_level)  title(Table1) titlestyles( font(, bold) ) export("Statin_table", as(docx) replace)
Table1
-------------------------------------
                          Summary    
-------------------------------------
N                               1,453
age                   40.919 (13.530)
absent                 14.467 (8.114)
household               3.257 (2.231)
group(gender)                        
  Female                  889 (61.2%)
  Male                    559 (38.5%)
  Non-binary                 5 (0.3%)
group(product)                       
  A                       495 (34.1%)
  B                       459 (31.6%)
  C                       499 (34.3%)
group(position_level)                
  1                       184 (12.7%)
  2                       404 (27.8%)
  3                       436 (30.0%)
  4                       231 (15.9%)
  5                       198 (13.6%)
-------------------------------------
(collection DTable exported to file Statin_table.docx)

Descriptive plots for single variables

use health_insurance

quietly histogram age ,title("Histogram of Age") note( "Source : Data source unknown") ytitle("Fraction") normal ytitle("Density") name(graph1 , replace)

quietly histogram absent ,title("Histogram of absentism") note( "Source : Data source unknown") ytitle("Fraction") normal ytitle("Density") name(graph2 , replace)

quietly histogram household ,title("Histogram of number of households") note( "Source : Data source unknown") ytitle("Fraction") normal ytitle("Density") name(graph3 , replace)

graph pie, over(position_level)  name(graph4 , replace) title("distribution of Position level")

graph hbar (count), over(gender) blabel(bar) name(graph5 , replace) title("distribution of Gender")


graph hbar (count), over(product) blabel(bar) name(graph6 , replace) title("distribution of Products")


graph combine graph1 graph2 graph3 graph4 graph5 graph6  ,title("")  

quietly graph export all_multi.svg, replace

Stata Graph - Graph 0 .02 .04 .06 Density 20 30 40 50 60 70 age Source : Data source unknown Histogram of Age 0 .01 .02 .03 .04 .05 Density 0 10 20 30 absent Source : Data source unknown Histogram of absentism 0 .2 .4 .6 .8 Density 0 2 4 6 8 household Source : Data source unknown Histogram of number of households 1 2 3 4 5 distribution of Position level 5 559 889 0 200 400 600 800 1,000 frequency Non-binary Male Female distribution of Gender 499 459 495 0 100 200 300 400 500 frequency C B A distribution of Products

Descriptive statistics by outcome variable

use health_insurance
dtable,by(product, tests testnotes) continuous(age absent household) factor(gender  position_level)  title(Table1) titlestyles( font(, bold) ) export("Statin_table", as(docx) replace)
note: using test regress across levels of product for age, absent, and
      household.
note: using test pearson across levels of product for gender and
      position_level.

Table1
-------------------------------------------------------------------------------------------
                                                  group(product)                           
                             A              B               C             Total       Test 
-------------------------------------------------------------------------------------------
N                        495 (34.1%)     459 (31.6%)     499 (34.3%)  1,453 (100.0%)       
age                   28.913 (5.158) 44.865 (13.756) 49.198 (10.346) 40.919 (13.530) <0.001
absent                14.451 (8.030)  14.394 (8.092)  14.549 (8.232)  14.467 (8.114)  0.956
household              3.642 (2.339)   1.462 (1.201)   4.527 (1.740)   3.257 (2.231) <0.001
group(gender)                                                                              
  Female                 244 (49.3%)     404 (88.0%)     241 (48.3%)     889 (61.2%) <0.001
  Male                   249 (50.3%)      53 (11.5%)     257 (51.5%)     559 (38.5%)       
  Non-binary                2 (0.4%)        2 (0.4%)        1 (0.2%)        5 (0.3%)       
group(position_level)                                                                      
  1                       66 (13.3%)      59 (12.9%)      59 (11.8%)     184 (12.7%)  0.848
  2                      136 (27.5%)     133 (29.0%)     135 (27.1%)     404 (27.8%)       
  3                      144 (29.1%)     146 (31.8%)     146 (29.3%)     436 (30.0%)       
  4                       81 (16.4%)      65 (14.2%)      85 (17.0%)     231 (15.9%)       
  5                       68 (13.7%)      56 (12.2%)      74 (14.8%)     198 (13.6%)       
-------------------------------------------------------------------------------------------
(collection DTable exported to file Statin_table.docx)

Plots by Outcome variable

use health_insurance

graph box age ,over(product) title("boxplot of Age by product") note( "Source : Data source unknown") ytitle("Fraction") ytitle("Density") name(graph1 , replace)

graph box absent ,over(product) title("boxplot of absentism by product") note( "Source : Data source unknown") ytitle("Fraction")  ytitle("Density") name(graph2 , replace)

graph box household ,over(product) title("boxplot of households by product ") note( "Source : Data source unknown") ytitle("Fraction") ytitle("Density") name(graph3 , replace)

graph hbar,over(product) over(position_level)  stack asyvars name(graph4 , replace) title("distribution of Position level")

graph hbar ,over(product) over(gender)  stack asyvars name(graph5 , replace) title("distribution of Gender")

graph combine graph1 graph2 graph3 graph4 graph5  ,title("")  
quietly graph export all_multii.svg, replace

Stata Graph - Graph 20 30 40 50 60 70 Density A B C Source : Data source unknown boxplot of Age by product 0 10 20 30 Density A B C Source : Data source unknown boxplot of absentism by product 0 2 4 6 8 Density A B C Source : Data source unknown boxplot of households by product  0 10 20 30 percent 5 4 3 2 1 distribution of Position level A B C 0 20 40 60 percent Non-binary Male Female distribution of Gender A B C

Assess the nature of the product variable.

Note
use health_insurance
tab product
group(produ |
        ct) |      Freq.     Percent        Cum.
------------+-----------------------------------
          A |        495       34.07       34.07
          B |        459       31.59       65.66
          C |        499       34.34      100.00
------------+-----------------------------------
      Total |      1,453      100.00
  • The product variable has 3 levels which represents 3 product types (\(A,B ~and~ C\))
  • Thus the variable is of nominal type because it represents distinct product types without an inherent order (i.e., A is not objectively less than B and B is not less than C).”

Model building

The most suitable model for this analysis is Multinomial logistic regression

Interpretations

Log Odds and Odds Ratios

  • The logit functions (log odds) when our outcome is a Product variable with (A,B and C) categories and the independent variable is age is shown below
  1. the log odds for comparison between B and A is

\[g_1(x) = ln\frac{P(product=B|age)} {P(product=A|age)}=\alpha_1 + (\beta_{11}age)\]

  1. and, the log odds for comparison between C and A is

\[g_2(x) = ln\frac{P(Product=C|age)} {P(Product=A|age)}=\alpha_2 + (\beta_{21}age)\]

  • Here the base category is product A hence all inferences are done with reference to it

thus for the simple model from the stata output we have two logit models:

\[ \begin{aligned} g_1(x) &= ln\frac{P(Product=B|age)} {P(Product=A|age)}=-6.974368 + (0.1969927age)\\ RRR_1&=\frac{P(Product=B|age)} {P(Product=A|age)}=e^{-6.974368 + (0.1969927age)} \end{aligned}\]

Note

Interpretation of coefficients

  • for one unit change in the variable age, the multinomial log-odds for choosing Product B relative to Product A will be increased by 0.1969927, age is also statistically significant at 5% level of significance.

Interpretation of exponenciated coefficients

  • for one unit change in the variable age, the relative risk ratio for choosing Product B relative to product A will be increased by \(e^{0.1969927}=1.21773\), age is also statistically significant at 5% level of significance.

\[ \begin{aligned} g_2(x) &= ln\frac{P(Product=C|age)} {P(Product=A|age)}=-8.323104 + (0.2274424age)\\ RRR_2&= \frac{P(Product=C|age)} {P(Product=A|age)}=e^{-8.323104 + (0.2274424age)} \end{aligned} \]

  • for one unit change in the variable age, the multinomial log-odds for choosing Product C relative to Product A will be increased by 0.2274424, age is also statistically significant at 5% level of significance.

Interpretation of exponenciated coefficients

  • for one unit change in the variable age, the relative risk ratio for choosing Product C relative to product A will be increased by a factor of \(e^{0.2274424}=1.255385\), age is also statistically significant at 5% level of significance.
Interpretation

Product B relative to Product A

\[ \begin{aligned} ln\frac{P(Product=B|X_1,X_2,...,X_n)} {P(Product=A|X_1,X_2,...,X_n)}&=-5.207851 + (0.2449861age)+(0.0137522Absent)-.9662861household-2.36947Male-0.240953Nonbinary -0.2033913Position_2-0.6830315Position_3-1.510145Position_4-1.407559Position_5\\ \frac{P(Product=B|X_1,X_2,...,X_n)} {P(Product=A|X_1,X_2,...,X_n)}&= e^{-5.207851 + (0.2449861age)+(0.0137522Absent)-.9662861household-2.36947Male-0.240953Nonbinary -0.2033913Position_2-0.6830315Position_3-1.510145Position_4-1.407559Position_5\\} \end{aligned}\]

  • for one unit change in the variable age, the relative risk ratio for choosing Product B relative to product A will be increased significantly by a factor of \(1.277604\) while adjusting for other covariates .Age is statistically significant \((p<0.05)\)
  • for a 1 day increase in absentism from previous year, the relative risk ratio for choosing product B relative to product A increase by a factor of \(1.013847\) while adjusting for other variables, however absentism is not statistically significant \((p>0.05)\)
  • If one person is added to the number of individuals in the employee’s household , the Relative Risk ratio of choosing Product B relative to product A decrease by a factor of \(.3804935\) while adjusting for other covariates in the model. Number of individuals in the household is also statistically significant at 5% level of significance.
  • The relative risk ratio for choosing Product B relative to product A decrease by (1-.0935297):91% for males as compared to females while adjusting for all other covariates.Gender is a statistically significant predictor for choice of insurance product.\((p<0.05)\)
  • The relative Risk ratio for choosing product B relative to product A increase as position level increase from position level 2 , 3, 4 and 5 relative to position 1 while holding all covariates constant. In essence this means that :
  1. decrease in RRR for Product 2 vs product 1 ->\(0.1840411\)
  2. decrease in RRR for Product 3 vs product 1 ->\(0.4949165^{***}\)
  3. decrease in RRR for Product 4 vs product 1 ->\(0.779122^{***}\)
  4. decrease in RRR for Product 5 vs product 1 ->\(0.7552601^{***}\)

here we not that the starred comparisons are the only significant ones at 5% level of significance.

Product C relative to Product A

\[ \begin{aligned} ln\frac{P(Product=C|X_1,X_2,...,X_n)} {P(Product=A|X_1,X_2,...,X_n)}&=-10.54573+(0.2706643age)+(0.0046137 Absent)+0.2077009household+0.1085684Male-1.293302Nonbinary -0.1427916Position_2-0.3147303Position_3-0.8247833Position_4-0.730075Position_5\\ \quad this~equates~to~RRR~as~follows~\\ \frac{P(Product=C|X_1,X_2,...,X_n)} {P(Product=A|X_1,X_2,...,X_n)}&=e^{-10.54573+(0.2706643age)+(0.0046137 Absent)+0.2077009household+0.1085684Male-1.293302Nonbinary -0.1427916Position_2-0.3147303Position_3-0.8247833Position_4-0.730075Position_5} \end{aligned}\]

  • for one unit change in the variable age, the relative risk ratio for choosing Product C relative to product A will be increased significantly by a factor of 1.310835 while adjusting for other covariates .Age is statistically significant
  • for a 1 day increase in absentism from previous year, the relative risk ratio for choosing product C relative to product A increase by a factor of 1.004624 while adjusting for other variables, however absentism is not statistically significant
  • If one person is added to the number of individuals in the employee’s household , the Relative Risk ratio of choosing Product C relative to product A Increase by a factor of 1.230845 while adjusting for other covariates in the model. Number of individuals in the household is also statistically significant at 5% level of significance.
  • The relative risk ratio for choosing Product C relative to product A Increase by (1.11468-1):11.5% for males as compared to females while adjusting for all other covariates. Gender is However not a statistically significant predictor for choice of insurance product here.
  • The relative Risk ratio for choosing product C relative to product A decrease as position level decrease from position level 2 , 3, 4 and 5 relative to position 1 while holding all covariates constant. In essence this means that :
  1. decrease in RRR for Position 2 vs Position 1 ->\(0.1330653\)
  2. decrease in RRR for Position 3 vs Position 1 ->\(0.2700143\)
  3. decrease in RRR for Position 4 vs Position 1 ->\(0.56167^{***}\)
  4. decrease in RRR for Position 5 vs Position 1 ->\(0.5181271\)

here we not that the starred comparisons are the only significant ones at 5% level of significance.

Evaluate and compare the overall fit

Null model vs Age model

use health_insurance
quietly mlogit product, baseoutcome(1) rrr

est store model1

quietly mlogit product age  , baseoutcome(1) rrr
est store model2

**Likelihood ratio test**
  
estout model1 model2 , stats(bic aic)  
  
lrtest model1 model2 
                   model1       model2
                        b            b
--------------------------------------
A                                     
age                                  0
_cons                   0            0
--------------------------------------
B                                     
age                           .1969927
_cons           -.0755076    -6.974368
--------------------------------------
C                                     
age                           .2274424
_cons            .0080483    -8.323104
--------------------------------------
bic              3205.108     2371.201
aic              3194.546     2350.076
--------------------------------------


Likelihood-ratio test
Assumption: model1 nested within model2

 LR chi2(2) = 848.47
Prob > chi2 = 0.0000
Note
  • For the model with age , AIC and BIC reduced significantly indicating that the model with age performs better than a null model.
  • Through the likelihood ratio test we see that the model with age performs better than the null model (model with no covariates) at 5% level of significance (\(p<0.001\))

Full model vs null model

use health_insurance
quietly mlogit product , baseoutcome(1) rrr

est store model1

quietly mlogit product age absent household i.gender i.position_level  , baseoutcome(1) rrr
est store model2

**Likelihood ratio test**
  
estout model1 model2 , stats(bic aic)  
  
lrtest model1 model2 
                   model1       model2
                        b            b
--------------------------------------
A                                     
age                                  0
absent                               0
household                            0
1.gender                             0
2.gender                             0
3.gender                             0
1.position~l                         0
2.position~l                         0
3.position~l                         0
4.position~l                         0
5.position~l                         0
_cons                   0            0
--------------------------------------
B                                     
age                           .2449861
absent                        .0137522
household                    -.9662861
1.gender                             0
2.gender                     -2.369476
3.gender                       .240953
1.position~l                         0
2.position~l                 -.2033913
3.position~l                 -.6830315
4.position~l                 -1.510145
5.position~l                 -1.407559
_cons           -.0755076    -5.207851
--------------------------------------
C                                     
age                           .2706643
absent                        .0046137
household                     .2077009
1.gender                             0
2.gender                      .1085684
3.gender                     -1.293302
1.position~l                         0
2.position~l                 -.1427916
3.position~l                 -.3147303
4.position~l                 -.8247833
5.position~l                  -.730075
_cons            .0080483    -10.54573
--------------------------------------
bic              3205.108     1632.389
aic              3194.546     1526.761
--------------------------------------


Likelihood-ratio test
Assumption: model1 nested within model2

LR chi2(18) = 1703.78
Prob > chi2 =  0.0000
Note
  • For the model with all variables , AIC and BIC reduced significantly indicating that the full model performs better than a null model.
  • Through the likelihood ratio test we see that the model with age performs better than the null model (model with no covariates) at 5% level of significance (\(p<0.001\))

Justify your selection

use health_insurance
quietly mlogit product age, baseoutcome(1) rrr

est store model1

quietly mlogit product age absent household i.gender i.position_level  , baseoutcome(1) rrr
est store model2

**Likelihood ratio test**
  
lrtest model1 model2 
Likelihood-ratio test
Assumption: model1 nested within model2

LR chi2(16) = 855.31
Prob > chi2 = 0.0000
Note
  • The results of the likelihood ratio tests can be used to ascertain the significance of predictors to the model. This table tells us that addition of more variables to age had significant main effects on Product choice,

\[\chi^2_{(16)} = 855.31, p = .000\]

use health_insurance

*Measures for model with age alone*
quietly mlogit product age, baseoutcome(1) rrr
fitstat , save

*Model with additional variables
quietly mlogit product age absent household i.gender i.position_level  , baseoutcome(1) rrr
fitstat, dif
Measures of Fit for mlogit of product

Log-Lik Intercept Only:    -1595.273     Log-Lik Full Model:        -1171.038
D(1447):                    2342.076     LR(2):                       848.470
                                         Prob > LR:                     0.000
McFadden's R2:                 0.266     McFadden's Adj R2:             0.262
Maximum Likelihood R2:         0.442     Cragg & Uhler's R2:            0.498
Count R2:                      0.509     Adj Count R2:                  0.255
AIC:                           1.620     AIC*n:                      2354.076
BIC:                       -8194.089     BIC':                       -833.907

(Indices saved in matrix fs_0)



Measures of Fit for mlogit of product

Log-Lik Intercept Only:    -1595.273     Log-Lik Full Model:         -743.381
D(1417):                    1486.761     LR(18):                     1703.784
                                         Prob > LR:                     0.000
McFadden's R2:                 0.534     McFadden's Adj R2:             0.511
Maximum Likelihood R2:         0.690     Cragg & Uhler's R2:            0.777
Count R2:                      0.566     Adj Count R2:                  0.341
AIC:                           1.073     AIC*n:                      1558.761
BIC:                       -8830.962     BIC':                      -1572.719
Comments
  • The table below and the one above show results for a full model and the one with a single predictor age respectively.
  • We see that on fitting the full model the following significant changes which show model improvement were noted:
  1. log likelihood increased from \(-1171.038~ to ~-743.381\)
  2. Akaike information criterion decreased from \(2354.076 ~to~ 1558.761\)
  3. Bayesian Information criterion decreased from \(-833.907 ~to~ -1572.719\)

all these results suggest that adding more variables to the model improved model fit. Hence the model with all variables performs better and the best than the model with AGE alone.