Generalized Linear Modeling

GLM theory suggests that the canonical link can be modeled as a linear combination of the explanatory variable(s). This approach unifies a number of modeling results used in statistics.

By so doing we have now generalized our modeling to handle non-normal responses. In addition to normally distributed responses, we are able to handle Poisson responses, binomial responses, and more. Writing a pmf or pdf for a response in one-parameter exponential family form reveals the canonical link which can be modeled as a linear function of the predictors.

Exponential Family

The theory of GLMs is developed for data with distribution given y belongs to the exponential family.
The form of the data distribution that is useful for GLMs is

\[ f(y;\theta, \phi) = \exp(\frac{\theta y - b(\theta)}{a(\phi)} + c(y, \phi)) \]

where

\(\theta\) is called the natural parameter
\(\phi\) is called the dispersion parameter

Example

if we have \(Y \sim N(\mu, \sigma^2)\)

\[ \begin{aligned} f(y; \mu, \sigma^2) &= \frac{1}{(2\pi \sigma^2)^{1/2}}\exp(-\frac{1}{2\sigma^2}(y- \mu)^2) \\ &= \exp(-\frac{1}{2\sigma^2}(y^2 - 2y \mu +\mu^2)- \frac{1}{2}\log(2\pi \sigma^2)) \\ &= \exp(\frac{y \mu - \mu^2/2}{\sigma^2} - \frac{y^2}{2\sigma^2} - \frac{1}{2}\log(2\pi \sigma^2)) \\ &= \exp(\frac{\theta y - b(\theta)}{a(\phi)} + c(y , \phi)) \end{aligned} \]

where

\(\theta = \mu\)
\(b(\theta) = \frac{\mu^2}{2}\)
\(a(\phi) = \sigma^2 = \phi\)
\(c(y , \phi) = - \frac{1}{2}(\frac{y^2}{\phi}+\log(2\pi \sigma^2))\)

Properties of GLM exponential families

\(E(Y) = b' (\theta)\) where \(b'(\theta) = \frac{\partial b(\theta)}{\partial \theta}\)
\(var(Y) = a(\phi)b''(\theta)= a(\phi)V(\mu)\).
- \(V(\mu)\) is the variance function; however, it is only the variance in the case that \(a(\phi) =1\)
If \(a(), b(), c()\) are identifiable, we will derive expected value and variance of Y.

Examples

Normal distribution

\[ \begin{aligned} b'(\theta) &= \frac{\partial b(\mu^2/2)}{\partial \mu} = \mu \\ V(\mu) &= \frac{\partial^2 (\mu^2/2)}{\partial \mu^2} = 1 \\ &\to var(Y) = a(\phi) = \sigma^2 \end{aligned} \]

Poisson distribution

\[ \begin{aligned} f(y, \theta, \phi) &= \frac{\mu^y \exp(-\mu)}{y!} \\ &= \exp(y\log(\mu) - \mu - \log(y!)) \\ &= \exp(y\theta - \exp(\theta) - \log(y!)) \end{aligned} \]

where

\(\theta = \log(\mu)\)
\(a(\phi) = 1\)
\(b(\theta) = \exp(\theta)\)
\(c(y, \phi) = \log(y!)\)

Hence,

\[ \begin{aligned} E(Y) &= \frac{\partial b(\theta)}{\partial \theta} = \exp(\theta) = \mu \\ var(Y) &= \frac{\partial^2 b(\theta)}{\partial \theta^2} = \mu \end{aligned} \]

Since \(\mu = E(Y) = b'(\theta)\)

In GLM, we take some monotone function (typically nonlinear) of \(\mu\) to be linear in the set of covariates

\[ g(\mu) = g(b'(\theta)) = \mathbf{x'\beta} \]

Equivalently,

\[ \mu = g^{-1}(\mathbf{x'\beta}) \]

where \(g(.)\) is the link function since it links mean response (\(\mu = E(Y)\)) and a linear expression of the covariates

Some people use \(\eta = \mathbf{x'\beta}\) where \(\eta\) = the “linear predictor”

GLM is composed of 2 components

The random component:

is the distribution chosen to model the response variables \(Y_1,...,Y_n\)
is specified by the choice fo \(a(), b(), c()\) in the exponential form
Notation:
- Assume that there are n independent response variables \(Y_1,...,Y_n\) with densities
  \[ f(y_i ; \theta_i, \phi) = \exp(\frac{\theta_i y_i - b(\theta_i)}{a(\phi)}+ c(y_i, \phi)) \] notice each observation might have different densities
- Assume that \(\phi\) is constant for all \(i = 1,...,n\), but \(\theta_i\) will vary. \(\mu_i = E(Y_i)\) for all i.

The systematic component

is the portion of the model that gives the relation between \(\mu\) and the covariates \(\mathbf{x}\)
consists of 2 parts:
- the link function, \(g(.)\)
- the linear predictor, \(\eta = \mathbf{x'\beta}\)
Notation:
- assume \(g(\mu_i) = \mathbf{x'\beta} = \eta_i\) where \(\mathbf{\beta} = (\beta_1,..., \beta_p)'\)
- The parameters to be estimated are \(\beta_1,...\beta_p , \phi\)

The Canonical Link

To choose \(g(.)\), we can use canonical link function (Remember: Canonical link is just a special case of the link function)

If the link function \(g(.)\) is such \(g(\mu_i) = \eta_i = \theta_i\), the natural parameter, then \(g(.)\) is the canonical link.

The inverse link

\(g^{-1}(.)\) is also known as the mean function, take linear predictor output (ranging from \(-\infty\) to \(\infty\)) and transform it into a different scale.

Exponential: converts \(\mathbf{\beta X}\) into a curve that is restricted between 0 and \(\infty\) (which you can see that is useful in case you want to convert a linear predictor into a non-negative value). \(\lambda = \exp(y) = \mathbf{\beta X}\)
Inverse Logit (also known as logistic): converts \(\mathbf{\beta X}\) into a curve that is restricted between 0 and 1, which is useful in case you want to convert a linear predictor to a probability. \(\theta = \frac{1}{1 + \exp(-y)} = \frac{1}{1 + \exp(- \mathbf{\beta X})}\)
- \(y\) = linear predictor value
- \(\theta\) = transformed value

The identity link is that

\[ \begin{aligned} \eta_i &= g(\mu_i) = \mu_i \\ \mu_i &= g^{-1}(\eta_i) = \eta_i\\ \end{aligned} \]

Examples

Normal random component

Mean Response: \(\mu_i = \theta_i\)
Canonical Link: \(g( \mu_i) = \mu_i\) (the identity link)

Binomial random component

Mean Response: \(\mu_i = \frac{n_i \exp( \theta)}{1+\exp (\theta_i)}\) and \(\theta(\mu_i) = \log(\frac{p_i }{1-p_i}) = \log (\frac{\mu_i} {n_i - \mu_i})\)
Canonical link: \(g(\mu_i) = \log(\frac{\mu_i} {n_i - \mu_i})\) (logit link)

Poisson random component

Mean Response: \(\mu_i = \exp(\theta_i)\)
Canonical Link: \(g(\mu_i) = \log(\mu_i)\)

Gamma random component:

Mean response: \(\mu_i = -\frac{1}{\theta_i}\) and \(\theta(\mu_i) = - \mu_i^{-1}\)
Canonical Link: \(g(\mu\_i) = - \frac{1}{\mu_i}\)

Inverse Gaussian random

Canonical Link: \(g(\mu_i) = \frac{1}{\mu_i^2}\)

Logistic Regression

If \(X\) follows a binomial distribution with \(n\) trials and probability of success \(p\), we can write:

\[\begin{align*} P(X=x)&= \binom{n}{x}p^x(1-p)^{(n-x)} \\ &=e^{x\log(p) + (n-x)\log(1-p) + \log\binom{n}{x}} \end{align*}\] However, this probability mass function is not quite in one-parameter exponential family form. Note that there are two terms in the exponent which consist of a product of functions of \(x\) and \(p\). So more simplification is in order:

\[\begin{equation*} P(X=x) = e^{x\log\left(\frac{p}{1-p}\right) + n\log(1-p)+ \log\binom{n}{x}} \end{equation*}\].

The one-parameter exponential family form for binomial responses shows that the canonical link is \(\log\left(\frac{p}{1-p}\right)\). Thus, GLM theory suggests that constructing a model using the logit, the log odds of a score, as a linear function of covariates is a reasonable approach.

\[ p_i = f(\mathbf{x}_i ; \beta) = \frac{exp(\mathbf{x_i'\beta})}{1 + exp(\mathbf{x_i'\beta})} \]

Equivalently,

\[ logit(p_i) = log(\frac{p_i}{1-p_i}) = \mathbf{x_i'\beta} \]

where \(\frac{p_i}{1-p_i}\)is the odds.

In this form, the model is specified such that a function of the mean response is linear. Hence, Generalized Linear Models

The likelihood function

\[L(p_i) = \prod_{i=1}^{n} p_i^{Y_i}(1-p_i)^{1-Y_i}\]

where \[p_i = \frac{\mathbf{exp(x'_i \beta})}{1+\mathbf{exp(x'_i \beta)}}\] and \(1-p_i = \frac{1}{1+ exp(\mathbf{x'_i \beta})}\)

Hence, our objective function is

\[ \begin{aligned} Q(\beta) &= log(L(p_i))\\ &= log\left(p_i^{\sum Y_i}(1-p_i)^{\sum(1-Y_i)}\right)\\ &= log\left(\frac{\mathbf{exp(x'_i \beta})}{1+\mathbf{exp(x'_i \beta)}})^{\sum Y_i}(\frac{1}{1+ exp(\mathbf{x'_i \beta})})^{\sum(1-Y_i)}\right)\\ &=\sum Y_ix'_i\beta - \sum Y_ilog(1+\exp(x'_i\beta))- \sum(1-Y_i)log(1+\exp(x'_i\beta))\\ &=\sum Y_ix'_i\beta - \sum Y_ilog(1+\exp(x'_i\beta))+\sum Y_ilog(1+\exp(x'_i\beta))- \sum log(1+\exp(x'_i\beta))\\ &= \sum_{i=1}^n Y_i \mathbf{x'_i \beta} - \sum_{i=1}^n log(1+ exp(\mathbf{x'_i \beta})) \end{aligned} \]

we could maximize this function numerically using the optimization method above, which allows us to find numerical MLE for \(\hat{\beta}\). Then we can use the standard asymptotic properties of MLEs to make inference.

Property of MLEs is that parameters are asymptotically unbiased with sample variance-covariance matrix given by the inverse Fisher information matrix

\[ \hat{\beta} \dot{\sim} AN(\beta,[\mathbf{I}(\beta)]^{-1}) \]

where the Fisher Information matrix, \(\mathbf{I}(\beta)\) is

\[ \begin{aligned} \mathbf{I}(\beta) &= E[\frac{\partial \log(L(\beta))}{\partial (\beta)}\frac{\partial \log(L(\beta))}{\partial \beta'}] \\ &= E[(\frac{\partial \log(L(\beta))}{\partial \beta_i} \frac{\partial \log(L(\beta))}{\partial \beta_j})_{ij}] \end{aligned} \]

Under regularity conditions, this is equivalent to the negative of the expected value of the Hessian Matrix

\[ \begin{aligned} \mathbf{I}(\beta) &= -E[\frac{\partial^2 \log(L(\beta))}{\partial \beta \partial \beta'}] \\ &= -E[(\frac{\partial^2 \log(L(\beta))}{\partial \beta_i \partial \beta_j})_{ij}] \end{aligned} \]

Example:

\[ x_i' \beta = \beta_0 + \beta_1 x_i \]

\[ \begin{aligned} - \frac{\partial^2 \ln(L(\beta))}{\partial \beta^2_0} &= \sum_{i=1}^n \frac{\exp(x'_i \beta)}{1 + \exp(x'_i \beta)} - [\frac{\exp(x_i' \beta)}{1+ \exp(x'_i \beta)}]^2 = \sum_{i=1}^n p_i (1-p_i) \\ - \frac{\partial^2 \ln(L(\beta))}{\partial \beta^2_1} &= \sum_{i=1}^n \frac{x_i^2\exp(x'_i \beta)}{1 + \exp(x'_i \beta)} - [\frac{x_i\exp(x_i' \beta)}{1+ \exp(x'_i \beta)}]^2 = \sum_{i=1}^n x_i^2p_i (1-p_i) \\ - \frac{\partial^2 \ln(L(\beta))}{\partial \beta_0 \partial \beta_1} &= \sum_{i=1}^n \frac{x_i\exp(x'_i \beta)}{1 + \exp(x'_i \beta)} - x_i[\frac{\exp(x_i' \beta)}{1+ \exp(x'_i \beta)}]^2 = \sum_{i=1}^n x_ip_i (1-p_i) \\ \end{aligned} \]

Hence,

\[ \mathbf{I} (\beta) = \left[ \begin{array} {cc} \sum_i p_i(1-p_i) & \sum_i x_i p_i(1-p_i) \\ \sum_i x_i p_i(1-p_i) & \sum_i x_i^2 p_i(1-p_i) \end{array} \right] \]

Inference

Likelihood Ratio Tests

To formulate the test, let \(\beta = [\beta_1', \beta_2']'\). If you are interested in testing a hypothesis about \(\beta_1\), then we leave \(\beta_2\) unspecified (called nuisance parameters). \(\beta_1\) and \(\beta_2\) can either a vector or scalar, or \(\beta_2\) can be null.

Example: \(H_0: \beta_1 = \beta_{1,0}\) (where \(\beta_{1,0}\) is specified) and \(\hat{\beta}_{2,0}\) be the MLE of \(\beta_2\) under the restriction that \(\beta_1 = \beta_{1,0}\). The likelihood ratio test statistic is

\[ -2\log\Lambda = -2[\log(L(\beta_{1,0},\hat{\beta}_{2,0})) - \log(L(\hat{\beta}_1,\hat{\beta}_2))] \]

where

the first term is the value fo the likelihood for the fitted restricted model
the second term is the likelihood value of the fitted unrestricted model

Under the null,

\[ -2 \log \Lambda \sim \chi^2_{\upsilon} \]

where \(\upsilon\) is the dimension of \(\beta_1\)

We reject the null when \(-2\log \Lambda > \chi_{\upsilon,1-\alpha}^2\)

Wald Statistics

Based on

\[ \hat{\beta} \sim AN (\beta, [\mathbf{I}(\beta)^{-1}]) \]

\[ H_0: \mathbf{L}\hat{\beta} = 0 \]

where \(\mathbf{L}\) is a q x p matrix with q linearly independent rows. Then

\[ W = (\mathbf{L\hat{\beta}})'(\mathbf{L[I(\hat{\beta})]^{-1}L'})^{-1}(\mathbf{L\hat{\beta}}) \]

under the null hypothesis

Confidence interval

\[ \hat{\beta}_i \pm 1.96 \hat{s}_{ii}^2 \]

where \(\hat{s}_{ii}^2\) is the i-th diagonal of \(\mathbf{[I(\hat{\beta})]}^{-1}\)

If you have

large sample size, the likelihood ratio and Wald tests have similar results.
small sample size, the likelihood ratio test is better.

Logistic Regression: Interpretation of \(\beta\)

For single regressor, the model is

\[ logit\{\hat{p}_{x_i}\} \equiv logit (\hat{p}_i) = \log(\frac{\hat{p}_i}{1 - \hat{p}_i}) = \hat{\beta}_0 + \hat{\beta}_1 x_i \]

When \(x= x_i + 1\)

\[ logit\{\hat{p}_{x_i +1}\} = \hat{\beta}_0 + \hat{\beta}(x_i + 1) = logit\{\hat{p}_{x_i}\} + \hat{\beta}_1 \]

Then,

\[ logit\{\hat{p}_{x_i +1}\} - logit\{\hat{p}_{x_i}\} = log\{odds[\hat{p}_{x_i +1}]\} - log\{odds[\hat{p}_{x_i}]\} \\ = log(\frac{odds[\hat{p}_{x_i + 1}]}{odds[\hat{p}_{x_i}]}) = \hat{\beta}_1 \]

and

\[ exp(\hat{\beta}_1) = \frac{odds[\hat{p}_{x_i + 1}]}{odds[\hat{p}_{x_i}]} \]

the estimated odds ratio

the estimated odds ratio, when there is a difference of c units in the regressor x, is \(exp(c\hat{\beta}_1)\). When there are multiple covariates, \(exp(\hat{\beta}_k)\) is the estimated odds ratio for the variable \(x_k\), assuming that all of the other variables are held constant.

Residuals for Binomial Regression

With LLSR, residuals were used to assess model assumptions and identify outliers. For binomial regression, two different types of residuals are typically used. One residual, the Pearson residual, has a form similar to that used with LLSR. Specifically, the Pearson residual is calculated using:

\[\begin{equation*} \textrm{Pearson residual}_i = \frac{\textrm{actual count}-\textrm{predicted count}}{\textrm{SD of count}} = \frac{Y_i-m_i\hat{p_i}}{\sqrt{m_i\hat{p_i}(1-\hat{p_i})}} \end{equation*}\] where \(m_i\) is the number of trials for the \(i^{th}\) observation and \(\hat{p}_i\) is the estimated probability of success for that same observation.

A deviance residual is an alternative residual for binomial regression based on the discrepancy between the observed values and those estimated using the likelihood. A deviance residual can be calculated for each observation using:

\[\begin{equation*} \textrm{d}_i = \textrm{sign}(Y_i-m_i\hat{p_i})\sqrt{2[Y_i \log\left(\frac{Y_i}{m_i \hat{p_i}}\right)+ (m_i - Y_i) \log\left(\frac{m_i - Y_i}{m_i - m_i \hat{p_i}}\right)]} \end{equation*}\]

When the number of trials is large for all of the observations and the models are appropriate, both sets of residuals should follow a standard normal distribution.

The sum of the individual deviance residuals is referred to as the deviance or residual deviance. The residual deviance is used to assess the model. As the name suggests, a model with a small deviance is preferred. In the case of binomial regression, when the denominators, \(m_i\), are large and a model fits, the residual deviance follows a \(\chi^2\) distribution with \(n-p\) degrees of freedom (the residual degrees of freedom). Thus for a good fitting model the residual deviance should be approximately equal to its corresponding degrees of freedom. When binomial data meets these conditions, the deviance can be used for a goodness-of-fit test. The p-value for lack-of-fit is the proportion of values from a \(\chi_{n-p}^2\) distribution that are greater than the observed residual deviance.

Application in R and STATA

About the dataset

age	sex	trestbps	chol	fbs	restecg	thalach	exang	oldpeak	slope	ca	thal	target
52	1	125	212	0	1	168	0	1.0	2	2	3	0
53	1	140	203	1	0	155	1	3.1	0	0	3	0
70	1	145	174	0	1	125	1	2.6	0	0	3	0
61	1	148	203	0	1	161	0	0.0	2	1	3	0
62	0	138	294	1	1	106	0	1.9	1	3	2	0
58	0	100	248	0	0	122	0	1.0	1	0	2	1

single variable discriptive statistics

quietly import delimited heart

label define sex 1 "male" 0 "female", replace
label value sex sex
label define fbs 1 "true" 0 "false", replace
label value fbs fbs
label define target 0 "No disease" 1 "Disease", replace
label value target target
label define thal 1 "normal" 2 "fixed defect" 3 "reversible defect", replace
label value thal thal
label define exang 1 "yes" 0 "no", replace
label value exang exang

quietly save heart , replace

quietly import delimited heart

dtable, continuous(age trestbps chol thalach oldpeak) factor(exang sex target thal fbs restecg cp ca slope)  title(Table1) titlestyles( font(, bold) ) export("Statin_table", as(docx) replace)

Table1
-------------------------
              Summary    
-------------------------
N                   1,025
age        54.434 (9.072)
trestbps 131.612 (17.517)
chol     246.000 (51.593)
thalach  149.114 (23.006)
oldpeak     1.072 (1.175)
exang                    
  0           680 (66.3%)
  1           345 (33.7%)
sex                      
  0           312 (30.4%)
  1           713 (69.6%)
target                   
  0           499 (48.7%)
  1           526 (51.3%)
thal                     
  0              7 (0.7%)
  1             64 (6.2%)
  2           544 (53.1%)
  3           410 (40.0%)
fbs                      
  0           872 (85.1%)
  1           153 (14.9%)
restecg                  
  0           497 (48.5%)
  1           513 (50.0%)
  2             15 (1.5%)
cp                       
  0           497 (48.5%)
  1           167 (16.3%)
  2           284 (27.7%)
  3             77 (7.5%)
ca                       
  0           578 (56.4%)
  1           226 (22.0%)
  2           134 (13.1%)
  3             69 (6.7%)
  4             18 (1.8%)
slope                    
  0             74 (7.2%)
  1           482 (47.0%)
  2           469 (45.8%)
-------------------------
(collection DTable exported to file Statin_table.docx)

Comments

The table above shows frequencies and proportions for categorical data.
Continuous data was summarised by means and standard deviations

use heart 

quietly histogram age ,title("Histogram of Age") note( "Source : Data source unknown") ytitle("Fraction") normal ytitle("Density") name(graph1 , replace)

quietly histogram trestbps ,title("Histogram of trestbps") note( "Source : Data source unknown") ytitle("Fraction") normal ytitle("Density") name(graph2 , replace)

quietly histogram chol ,title("Histogram of chol") note( "Source : Data source unknown") ytitle("Fraction") normal ytitle("Density") name(graph3 , replace)

quietly histogram thalach ,title("Histogram of thalach") note( "Source : Data source unknown") ytitle("Fraction") normal ytitle("Density") name(graph4 , replace)

quietly histogram oldpeak ,title("Histogram of oldpeak") note( "Source : Data source unknown") ytitle("Fraction") normal ytitle("Density") name(graph5 , replace)

graph pie, over(sex)  name(graph6 , replace) title("distribution of sex")

graph pie, over(exang) name(graph7 , replace) title("distribution of exang")

graph hbar (count), over(cp) blabel(bar) name(graph8 , replace) title("distribution of cp")

graph hbar (count), over(ca) blabel(bar) name(graph9 , replace) title("distribution of ca")

graph hbar (count), over(fbs) blabel(bar) name(graph10 , replace) title("distribution of fbs")

graph hbar (count), over(thal) blabel(bar) name(graph11 , replace) title("distribution of thal")

graph hbar (count), over(restecg) blabel(bar) name(graph12 , replace) title("distribution of restecg")

graph hbar (count), over(slope) blabel(bar) name(graph13 , replace) title("distribution of slope")

graph pie, over(target) name(graph14 , replace) title("distribution of target")

graph combine graph1 graph2 graph3 graph4 graph5 graph6 graph7 graph8 graph9 graph10 graph11 graph12 graph13 graph14  ,title("")  

quietly graph export all_graphs.svg, replace

Summary statistics by Outcome Variable

quietly import delimited heart

dtable,by(target, tests testnotes) continuous(age trestbps chol thalach oldpeak) factor(exang sex thal fbs restecg cp ca slope)  title(Table1) titlestyles( font(, bold) ) export("target", as(docx) replace)

note: using test regress across levels of target for age, trestbps, chol,
      thalach, and oldpeak.
note: using test pearson across levels of target for exang, sex, thal, fbs,
      restecg, cp, ca, and slope.

Table1
------------------------------------------------------------------
                                   target                         
                 0                1              Total       Test 
------------------------------------------------------------------
N             499 (48.7%)      526 (51.3%)   1,025 (100.0%)       
age        56.569 (7.908)   52.409 (9.632)   54.434 (9.072) <0.001
trestbps 134.106 (18.577) 129.245 (16.112) 131.612 (17.517) <0.001
chol     251.293 (49.559) 240.979 (53.010) 246.000 (51.593)  0.001
thalach  139.130 (22.565) 158.586 (19.097) 149.114 (23.006) <0.001
oldpeak     1.600 (1.291)    0.570 (0.771)    1.072 (1.175) <0.001
exang                                                             
  0           225 (45.1%)      455 (86.5%)      680 (66.3%) <0.001
  1           274 (54.9%)       71 (13.5%)      345 (33.7%)       
sex                                                               
  0            86 (17.2%)      226 (43.0%)      312 (30.4%) <0.001
  1           413 (82.8%)      300 (57.0%)      713 (69.6%)       
thal                                                              
  0              4 (0.8%)         3 (0.6%)         7 (0.7%) <0.001
  1             43 (8.6%)        21 (4.0%)        64 (6.2%)       
  2           132 (26.5%)      412 (78.3%)      544 (53.1%)       
  3           320 (64.1%)       90 (17.1%)      410 (40.0%)       
fbs                                                               
  0           417 (83.6%)      455 (86.5%)      872 (85.1%)  0.188
  1            82 (16.4%)       71 (13.5%)      153 (14.9%)       
restecg                                                           
  0           283 (56.7%)      214 (40.7%)      497 (48.5%) <0.001
  1           204 (40.9%)      309 (58.7%)      513 (50.0%)       
  2             12 (2.4%)         3 (0.6%)        15 (1.5%)       
cp                                                                
  0           375 (75.2%)      122 (23.2%)      497 (48.5%) <0.001
  1             33 (6.6%)      134 (25.5%)      167 (16.3%)       
  2            65 (13.0%)      219 (41.6%)      284 (27.7%)       
  3             26 (5.2%)        51 (9.7%)        77 (7.5%)       
ca                                                                
  0           163 (32.7%)      415 (78.9%)      578 (56.4%) <0.001
  1           160 (32.1%)       66 (12.5%)      226 (22.0%)       
  2           113 (22.6%)        21 (4.0%)      134 (13.1%)       
  3            60 (12.0%)         9 (1.7%)        69 (6.7%)       
  4              3 (0.6%)        15 (2.9%)        18 (1.8%)       
slope                                                             
  0             46 (9.2%)        28 (5.3%)        74 (7.2%) <0.001
  1           324 (64.9%)      158 (30.0%)      482 (47.0%)       
  2           129 (25.9%)      340 (64.6%)      469 (45.8%)       
------------------------------------------------------------------
(collection DTable exported to file target.docx)

Comments

Preliminary results from Chi-square tests of association and student T.tests indicate that at \(5\%\) level of significance, most variables are are associated with being diagnosed or not being diagnosed with the disease.

use heart 

graph box age , over(target) title("distribution of Age by target") note( "Source : Data source unknown") ytitle("Fraction")  ytitle("Density") name(graph1 , replace)

graph box trestbps , over(target) title("distribution of trestbps by target") note( "Source : Data source unknown") ytitle("Fraction")  ytitle("Density") name(graph2 , replace)

graph box chol ,over(target) title("distribution of chol by target") note( "Source : Data source unknown") ytitle("Fraction")  ytitle("Density") name(graph3 , replace)

graph box thalach ,over(target) title("distribution of thalach by target") note( "Source : Data source unknown") ytitle("Fraction") ytitle("Density") name(graph4 , replace)

graph box oldpeak ,over(target) title("distribution of oldpeak by target") note( "Source : Data source unknown") ytitle("Fraction")  ytitle("Density") name(graph5 , replace)

graph hbar ,over(target) over(sex) stack asyvars name(graph6 , replace) title("distribution of sex")

graph hbar ,over(target) over(exang) stack asyvars name(graph7 , replace) title("distribution of exang")

graph hbar ,over(target) over(cp) stack asyvars name(graph8 , replace) title("distribution of cp")

graph hbar ,over(target) over(ca)  stack asyvars name(graph9 , replace) title("distribution of ca")

graph hbar,over(target) over(fbs) stack asyvars name(graph10 , replace) title("distribution of fbs")

graph hbar ,over(target) over(thal) stack asyvars name(graph11 , replace) title("distribution of thal")

graph hbar ,over(target) over(restecg) stack asyvars name(graph12 , replace) title("distribution of restecg")

graph hbar ,over(target) over(slope)  stack asyvars name(graph13 , replace) title("distribution of slope")



graph combine graph1 graph2 graph3 graph4 graph5 graph6 graph7 graph8 graph9 graph10 graph11 graph12 graph13  ,title("") 

quietly graph export all_target.svg, replace

Fit a simple logistic regression model using age as the only risk factor of heart disease status

Recap

A logistic regression model is a generalized linear model and is made of a linear predictor:

\[\underbrace{g(\mu_i)}_{Link~~function} = \underbrace{\beta_0 + \beta_1X_1~+~...~+~\beta_pX_p}_{Linear~component}\]

quietly import delimited heart

logit target age

Iteration 0:  Log likelihood = -710.12021  
Iteration 1:  Log likelihood = -682.58876  
Iteration 2:  Log likelihood = -682.53284  
Iteration 3:  Log likelihood = -682.53284  

Logistic regression                                     Number of obs =  1,025
                                                        LR chi2(1)    =  55.17
                                                        Prob > chi2   = 0.0000
Log likelihood = -682.53284                             Pseudo R2     = 0.0388

------------------------------------------------------------------------------
      target | Coefficient  Std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
         age |  -.0530036   .0073802    -7.18   0.000    -.0674686   -.0385386
       _cons |   2.943003   .4084775     7.20   0.000     2.142402    3.743605
------------------------------------------------------------------------------

mod<-glm(target~ age,data=new,family=binomial)
summary(mod)


Call:
glm(formula = target ~ age, family = binomial, data = new)

Coefficients:
            Estimate Std. Error z value          Pr(>|z|)    
(Intercept)  2.94300    0.40848   7.205 0.000000000000581 ***
age         -0.05300    0.00738  -7.182 0.000000000000688 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1420.2  on 1024  degrees of freedom
Residual deviance: 1365.1  on 1023  degrees of freedom
AIC: 1369.1

Number of Fisher Scoring iterations: 4

Note

the equation below shows a simple logistic regression model fit on age as the predictor.

\[ \begin{aligned} \log(\frac{\mu_i}{1-\mu_i})= 2.943003-.0530036Age \end{aligned}\]

(b) Report the estimated coefficient for age, and interpret its sign and meaning in the context of heart disease risk

Comments

the model for the output above can be written as follows

\[ \begin{aligned} \text{logit}(\mu_i) &= \log(\frac{\mu_i}{1-\mu_i})\\ &= 2.943003-.0530036Age \end{aligned}\]

The estimated coefficient for age is -0.0530036 and it is negative
Here we see that for a 1 year increase in age , the log odds of being diagnosed with heart disease decrease significantly by 0.0530036
Thus the chances of developing heart disease decrease with age

Is Age significantly associated with heart disease?

Age is indeed significantly associated with probability of developing heart disease at 5% level of significance since the corresponding p-value is 0.000 and is less than 0.05
the coefficient’s confidence is given by \([-0.0674686 ,-0.0385386]\) and does not contain zero hence the variable is significant.

Determining the odds

\[ \begin{aligned} \log(\frac{\mu_i}{1-\mu_i})&= 2.943003-.0530036Age\\ \frac{\mu_i}{1-\mu_i}&=e^{2.943003-.0530036Age}\\ \frac{\mu_i}{1-\mu_i}&=e^{2.943003}.e^{-.0530036Age} \end{aligned}\]

the odds associated with age is therefore \(e^{-.0530036}=0.9483766\) meaning that for a 1 year increase in age the odds of being diagnosed with heart disease decrease by \((1-0.9483766)*100\% \approx 5.16\%\)

Code below shows how to get the estimated odds ratios

quietly import delimited heart

logistic target age

Logistic regression                                     Number of obs =  1,025
                                                        LR chi2(1)    =  55.17
                                                        Prob > chi2   = 0.0000
Log likelihood = -682.53284                             Pseudo R2     = 0.0388

------------------------------------------------------------------------------
      target | Odds ratio   Std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
         age |   .9483766   .0069993    -7.18   0.000     .9347571    .9621946
       _cons |   18.97274   7.749939     7.20   0.000      8.51988    42.25001
------------------------------------------------------------------------------
Note: _cons estimates baseline odds.

Tip

also note that we can find the probability of being diagnosed at a given age by the following probability formula.

\[ \begin{aligned} \hat{p}(x) &= \frac{exp^{\hat{\beta}_{0} + \hat{\beta}_{1}{age}}}{1 + exp^{\hat{\beta}_{0} + \hat{\beta}_{1}{age}}}\\ &= \frac{exp^{2.943003-.0530036Age}}{1 + exp^{2.943003-.0530036Age}} \end{aligned}\]

likelihood ratio test

quietly import delimited heart
quietly logit target
est store model1
quietly logit target age
est store model2
lrtest model1 model2

Likelihood-ratio test
Assumption: model1 nested within model2

 LR chi2(1) =  55.17
Prob > chi2 = 0.0000

mod0<-glm(target~ 1,data=new,family=binomial)
mod1<-glm(target~ age,data=new,family=binomial)
anova(mod0,mod1,test="Chisq")

Resid. Df	Resid. Dev	Df	Deviance	Pr(>Chi)
1024	1420.240	NA	NA	NA
1023	1365.066	1	55.17475	0

Comments

The likelihood ratio test above compares the null model with a model with age alone.
the \(p-value<0.001\) indicates that the model with age is better fit compared to a model with no covariates at all.
thus our current model fits the data better than a model with no predictors.

Add sex variable

quietly import delimited heart

logit target age i.sex

Iteration 0:  Log likelihood = -710.12021  
Iteration 1:  Log likelihood =  -630.0552  
Iteration 2:  Log likelihood = -629.90565  
Iteration 3:  Log likelihood = -629.90562  

Logistic regression                                     Number of obs =  1,025
                                                        LR chi2(2)    = 160.43
                                                        Prob > chi2   = 0.0000
Log likelihood = -629.90562                             Pseudo R2     = 0.1130

------------------------------------------------------------------------------
      target | Coefficient  Std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
         age |  -.0674349   .0080345    -8.39   0.000    -.0831823   -.0516875
       1.sex |  -1.533812   .1585261    -9.68   0.000    -1.844518   -1.223107
       _cons |   4.819817   .4868406     9.90   0.000     3.865627    5.774007
------------------------------------------------------------------------------

mod2<-glm(target~ age+sex,data=new,family=binomial)
summary(mod2)


Call:
glm(formula = target ~ age + sex, family = binomial, data = new)

Coefficients:
             Estimate Std. Error z value            Pr(>|z|)    
(Intercept)  4.819818   0.486841   9.900 <0.0000000000000002 ***
age         -0.067435   0.008035  -8.393 <0.0000000000000002 ***
sex         -1.533812   0.158526  -9.675 <0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1420.2  on 1024  degrees of freedom
Residual deviance: 1259.8  on 1022  degrees of freedom
AIC: 1265.8

Number of Fisher Scoring iterations: 4

Comments

the model for the output above can be written as follows

\[ \begin{aligned} \text{logit}(\mu_i) &= \log(\frac{\mu_i}{1-\mu_i})\\ &= 4.819817-.0674349Age-1.533812Male \end{aligned}\]

The estimated coefficient for age is -.0674349 and it is negative
Here we see that for a 1 year increase in age , the log odds of being diagnosed with heart disease decrease significantly by 0.0674349 while adjusting for gender of the patient.
whilst Adjusting for Age, males have 1.533812 less log odds of being diagnosed with heart disease as compared to females.
the two variables are significant at 5% level of significance since they have p-values less than 5%

Confidence Intervals for Model Coefficients

Since the Wald statistic follows a normal distribution with \(n\) large, we could generate a Wald-type (normal-based) confidence interval for \(\beta_2\) using:

\(\hat\beta_2 \pm 1.96\cdot\textrm{SE}(\hat\beta_2)\) and then exponentiating endpoints if we prefer a confidence interval for the odds ratio \(e^{\beta_2}\). In this case,

\[ \begin{align*} 95\% \textrm{ CI for } \beta_2 &= \hat{\beta}_2 \pm 1.96 \cdot \textrm{SE}(\hat{\beta}_2) \\ &= -1.533812 \pm 1.96 \cdot 0.158526 \\ &= -0.0132 \pm 0.00764 \\ &= (-1.223101, -1.844523) \\ 95\% \textrm{ CI for } e^{\beta_2} &= (e^{-1.223101}, e^{-1.844523}) \\ &= (0.2943161, 0.1581007) \\ \end{align*} \]

confint(mod2)

Waiting for profiling to be done...

                  2.5 %      97.5 %
(Intercept)  3.88237336  5.79217853
age         -0.08342613 -0.05190659
sex         -1.84937968 -1.22746318

exp(confint(mod2))

Waiting for profiling to be done...

                 2.5 %      97.5 %
(Intercept) 48.5392799 327.7262070
age          0.9199590   0.9494175
sex          0.1573347   0.2930350

Thus, we can be 95% confident that being male is associated with a 84% to 70.7% decrease in the odds of dying after controlling age.

Multinomial logistic Regression

If we have more than two categories or groups that we want to model relative to covariates (e.g., we have observations \(i = 1,…,n\) and groups/ covariates \(j = 1,2,…,J\)), multinomial is our candidate model

Let

\(p_{ij}\) be the probability that the i-th observation belongs to the j-th group
\(Y_{ij}\) be the number of observations for individual i in group j; An individual will have observations \(Y_{i1},Y_{i2},…Y_{iJ}\)
assume the probability of observing this response is given by a multinomial distribution in terms of probabilities \(p_{ij}\), where \(\sum_{j = 1}^J p_{ij} = 1\) . For interpretation, we have a baseline category \(p_{i1} = 1 - \sum_{j = 2}^J p_{ij}\)

The link between the mean response (probability) \(p_{ij}\) and a linear function of the covariates

\[ \eta_{ij} = \mathbf{x'_i \beta_j} = \log \frac{p_{ij}}{p_{i1}}, j = 2,..,J \]

We compare \(p_{ij}\) to the baseline \(p_{i1}\), suggesting

\[ p_{ij} = \frac{\exp(\eta_{ij})}{1 + \sum_{i=2}^J \exp(\eta_{ij})} \]

which is known as multinomial logistic model.

Note:

Softmax coding for multinomial logistic regression: rather than selecting a baseline class, we treat all K class symmetrically - equally important (no baseline).

\[ P(Y = k | X = x) = \frac{exp(\beta_{k1} + \dots + \beta_{k_p x_p})}{\sum_{l = 1}^K exp(\beta_{l0} + \dots + \beta_{l_p x_p})} \]

then the log odds ratio between k-th and k’-th classes is

\[ \log (\frac{P(Y=k|X=x)}{P(Y = k' | X=x)}) = (\beta_{k0} - \beta_{k'0}) + \dots + (\beta_{kp} - \beta_{k'p}) x_p \]

About the dataset

You are an analyst at a large technology company. The company recently introduced a new health insurance provider for its employees. At the beginning of the year the employees had to choose one of three different health plan products from this provider to best suit their needs. You have been asked to determine which factors influenced the choice in product.

Thehealth_insurance` data set consists of the following fields:

product: The choice of product of the individual—A, B or C
age: The age of the individual when they made the choice
gender: The gender of the individual as stated when they made the choice
household: The number of people living with the individual in the same household at the time of the choice
position_level: Position level in the company at the time they made the choice, where 1 is is the lowest and 5 is the highest
absent: The number of days the individual was absent from work in the year prior to the choice

Descriptive statistics for single variables

quietly import delimited health_insurance 
egen product2 = group(product), label
egen position_level2 = group(position_level), label
egen gender2 = group(gender), label
drop product
drop position_level
drop gender
rename product2 product
rename gender2 gender
rename position_level2 position_level
quietly save health_insurance , replace
dtable, continuous(age absent household) factor(gender product position_level)  title(Table1) titlestyles( font(, bold) ) export("Statin_table", as(docx) replace)

Table1
-------------------------------------
                          Summary    
-------------------------------------
N                               1,453
age                   40.919 (13.530)
absent                 14.467 (8.114)
household               3.257 (2.231)
group(gender)                        
  Female                  889 (61.2%)
  Male                    559 (38.5%)
  Non-binary                 5 (0.3%)
group(product)                       
  A                       495 (34.1%)
  B                       459 (31.6%)
  C                       499 (34.3%)
group(position_level)                
  1                       184 (12.7%)
  2                       404 (27.8%)
  3                       436 (30.0%)
  4                       231 (15.9%)
  5                       198 (13.6%)
-------------------------------------
(collection DTable exported to file Statin_table.docx)

Descriptive plots for single variables

use health_insurance

quietly histogram age ,title("Histogram of Age") note( "Source : Data source unknown") ytitle("Fraction") normal ytitle("Density") name(graph1 , replace)

quietly histogram absent ,title("Histogram of absentism") note( "Source : Data source unknown") ytitle("Fraction") normal ytitle("Density") name(graph2 , replace)

quietly histogram household ,title("Histogram of number of households") note( "Source : Data source unknown") ytitle("Fraction") normal ytitle("Density") name(graph3 , replace)

graph pie, over(position_level)  name(graph4 , replace) title("distribution of Position level")

graph hbar (count), over(gender) blabel(bar) name(graph5 , replace) title("distribution of Gender")


graph hbar (count), over(product) blabel(bar) name(graph6 , replace) title("distribution of Products")


graph combine graph1 graph2 graph3 graph4 graph5 graph6  ,title("")  

quietly graph export all_multi.svg, replace

Descriptive statistics by outcome variable

use health_insurance
dtable,by(product, tests testnotes) continuous(age absent household) factor(gender  position_level)  title(Table1) titlestyles( font(, bold) ) export("Statin_table", as(docx) replace)

note: using test regress across levels of product for age, absent, and
      household.
note: using test pearson across levels of product for gender and
      position_level.

Table1
-------------------------------------------------------------------------------------------
                                                  group(product)                           
                             A              B               C             Total       Test 
-------------------------------------------------------------------------------------------
N                        495 (34.1%)     459 (31.6%)     499 (34.3%)  1,453 (100.0%)       
age                   28.913 (5.158) 44.865 (13.756) 49.198 (10.346) 40.919 (13.530) <0.001
absent                14.451 (8.030)  14.394 (8.092)  14.549 (8.232)  14.467 (8.114)  0.956
household              3.642 (2.339)   1.462 (1.201)   4.527 (1.740)   3.257 (2.231) <0.001
group(gender)                                                                              
  Female                 244 (49.3%)     404 (88.0%)     241 (48.3%)     889 (61.2%) <0.001
  Male                   249 (50.3%)      53 (11.5%)     257 (51.5%)     559 (38.5%)       
  Non-binary                2 (0.4%)        2 (0.4%)        1 (0.2%)        5 (0.3%)       
group(position_level)                                                                      
  1                       66 (13.3%)      59 (12.9%)      59 (11.8%)     184 (12.7%)  0.848
  2                      136 (27.5%)     133 (29.0%)     135 (27.1%)     404 (27.8%)       
  3                      144 (29.1%)     146 (31.8%)     146 (29.3%)     436 (30.0%)       
  4                       81 (16.4%)      65 (14.2%)      85 (17.0%)     231 (15.9%)       
  5                       68 (13.7%)      56 (12.2%)      74 (14.8%)     198 (13.6%)       
-------------------------------------------------------------------------------------------
(collection DTable exported to file Statin_table.docx)

Plots by Outcome variable

use health_insurance

graph box age ,over(product) title("boxplot of Age by product") note( "Source : Data source unknown") ytitle("Fraction") ytitle("Density") name(graph1 , replace)

graph box absent ,over(product) title("boxplot of absentism by product") note( "Source : Data source unknown") ytitle("Fraction")  ytitle("Density") name(graph2 , replace)

graph box household ,over(product) title("boxplot of households by product ") note( "Source : Data source unknown") ytitle("Fraction") ytitle("Density") name(graph3 , replace)

graph hbar,over(product) over(position_level)  stack asyvars name(graph4 , replace) title("distribution of Position level")

graph hbar ,over(product) over(gender)  stack asyvars name(graph5 , replace) title("distribution of Gender")

graph combine graph1 graph2 graph3 graph4 graph5  ,title("")  
quietly graph export all_multii.svg, replace

Assess the nature of the product variable.

Note

use health_insurance
tab product

group(produ |
        ct) |      Freq.     Percent        Cum.
------------+-----------------------------------
          A |        495       34.07       34.07
          B |        459       31.59       65.66
          C |        499       34.34      100.00
------------+-----------------------------------
      Total |      1,453      100.00

The product variable has 3 levels which represents 3 product types (\(A,B ~and~ C\))
Thus the variable is of nominal type because it represents distinct product types without an inherent order (i.e., A is not objectively less than B and B is not less than C).”

Model building

The most suitable model for this analysis is Multinomial logistic regression

Interpretations

Log Odds and Odds Ratios

The logit functions (log odds) when our outcome is a Product variable with (A,B and C) categories and the independent variable is age is shown below

the log odds for comparison between B and A is

\[g_1(x) = ln\frac{P(product=B|age)} {P(product=A|age)}=\alpha_1 + (\beta_{11}age)\]

and, the log odds for comparison between C and A is

\[g_2(x) = ln\frac{P(Product=C|age)} {P(Product=A|age)}=\alpha_2 + (\beta_{21}age)\]

Here the base category is product A hence all inferences are done with reference to it

use health_insurance
mlogit product age , baseoutcome(1)
mlogit product age , baseoutcome(1) rrr

Iteration 0:  Log likelihood = -1595.2728  
Iteration 1:  Log likelihood = -1230.0067  
Iteration 2:  Log likelihood = -1175.2485  
Iteration 3:  Log likelihood = -1171.0613  
Iteration 4:  Log likelihood = -1171.0379  
Iteration 5:  Log likelihood = -1171.0379  

Multinomial logistic regression                         Number of obs =  1,453
                                                        LR chi2(2)    = 848.47
                                                        Prob > chi2   = 0.0000
Log likelihood = -1171.0379                             Pseudo R2     = 0.2659

------------------------------------------------------------------------------
     product | Coefficient  Std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
A            |  (base outcome)
-------------+----------------------------------------------------------------
B            |
         age |   .1969927   .0122123    16.13   0.000      .173057    .2209283
       _cons |  -6.974368   .4166712   -16.74   0.000    -7.791029   -6.157708
-------------+----------------------------------------------------------------
C            |
         age |   .2274424   .0125945    18.06   0.000     .2027576    .2521272
       _cons |  -8.323104   .4420157   -18.83   0.000    -9.189439   -7.456769
------------------------------------------------------------------------------


Iteration 0:  Log likelihood = -1595.2728  
Iteration 1:  Log likelihood = -1230.0067  
Iteration 2:  Log likelihood = -1175.2485  
Iteration 3:  Log likelihood = -1171.0613  
Iteration 4:  Log likelihood = -1171.0379  
Iteration 5:  Log likelihood = -1171.0379  

Multinomial logistic regression                         Number of obs =  1,453
                                                        LR chi2(2)    = 848.47
                                                        Prob > chi2   = 0.0000
Log likelihood = -1171.0379                             Pseudo R2     = 0.2659

------------------------------------------------------------------------------
     product |        RRR   Std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
A            |  (base outcome)
-------------+----------------------------------------------------------------
B            |
         age |   1.217735   .0148713    16.13   0.000     1.188934    1.247234
       _cons |   .0009356   .0003898   -16.74   0.000     .0004134    .0021171
-------------+----------------------------------------------------------------
C            |
         age |   1.255385    .015811    18.06   0.000     1.224776     1.28676
       _cons |   .0002428   .0001073   -18.83   0.000     .0001021    .0005775
------------------------------------------------------------------------------
Note: _cons estimates baseline relative risk for each outcome.

model<-nnet::multinom(product~age, data=my_data)

# weights:  9 (4 variable)
initial  value 1596.283655 
iter  10 value 1171.049631
final  value 1171.037911 
converged

tab_model(model)

	group(product)
Predictors	Odds Ratios	CI	p	Response
(Intercept)	0.00	0.00 – 0.00	<0.001	B
age	1.22	1.19 – 1.25	<0.001	B
(Intercept)	0.00	0.00 – 0.00	<0.001	C
age	1.26	1.22 – 1.29	<0.001	C
Observations	1453
R² / R² adjusted	0.266 / 0.265

thus for the simple model from the stata output we have two logit models:

\[ \begin{aligned} g_1(x) &= ln\frac{P(Product=B|age)} {P(Product=A|age)}=-6.974368 + (0.1969927age)\\ RRR_1&=\frac{P(Product=B|age)} {P(Product=A|age)}=e^{-6.974368 + (0.1969927age)} \end{aligned}\]

Note

Interpretation of coefficients

for one unit change in the variable age, the multinomial log-odds for choosing Product B relative to Product A will be increased by 0.1969927, age is also statistically significant at 5% level of significance.

Interpretation of exponenciated coefficients

for one unit change in the variable age, the relative risk ratio for choosing Product B relative to product A will be increased by \(e^{0.1969927}=1.21773\), age is also statistically significant at 5% level of significance.

\[ \begin{aligned} g_2(x) &= ln\frac{P(Product=C|age)} {P(Product=A|age)}=-8.323104 + (0.2274424age)\\ RRR_2&= \frac{P(Product=C|age)} {P(Product=A|age)}=e^{-8.323104 + (0.2274424age)} \end{aligned} \]

for one unit change in the variable age, the multinomial log-odds for choosing Product C relative to Product A will be increased by 0.2274424, age is also statistically significant at 5% level of significance.

Interpretation of exponenciated coefficients

for one unit change in the variable age, the relative risk ratio for choosing Product C relative to product A will be increased by a factor of \(e^{0.2274424}=1.255385\), age is also statistically significant at 5% level of significance.

model2<-nnet::multinom(product~age+absent+household+gender+position_level, data=my_data)

# weights:  33 (20 variable)
initial  value 1596.283655 
iter  10 value 896.964516
iter  20 value 754.444237
iter  30 value 743.380590
final  value 743.380575 
converged

tab_model(model2)

	group(product)
Predictors	Odds Ratios	CI	p	Response
(Intercept)	0.01	0.00 – 0.02	<0.001	B
age	1.28	1.24 – 1.32	<0.001	B
absent	1.01	0.99 – 1.04	0.297	B
household	0.38	0.33 – 0.44	<0.001	B
group(gender): Male	0.09	0.06 – 0.15	<0.001	B
group(gender): Non-binary	1.27	0.12 – 13.77	0.843	B
group(position level): position level 2	0.82	0.44 – 1.50	0.515	B
group(position level): position level 3	0.51	0.27 – 0.93	0.028	B
group(position level): position level 4	0.22	0.10 – 0.49	<0.001	B
group(position level): position level 5	0.24	0.11 – 0.54	<0.001	B
(Intercept)	0.00	0.00 – 0.00	<0.001	C
age	1.31	1.27 – 1.35	<0.001	C
absent	1.00	0.98 – 1.03	0.717	C
household	1.23	1.12 – 1.36	<0.001	C
group(gender): Male	1.11	0.76 – 1.64	0.581	C
group(gender): Non-binary	0.27	0.01 – 14.42	0.521	C
group(position level): position level 2	0.87	0.48 – 1.58	0.640	C
group(position level): position level 3	0.73	0.40 – 1.33	0.305	C
group(position level): position level 4	0.44	0.22 – 0.88	0.021	C
group(position level): position level 5	0.48	0.23 – 1.01	0.055	C
Observations	1453
R² / R² adjusted	0.534 / 0.533

use health_insurance
ssc install fitstat
mlogit product age absent household i.gender i.position_level  , baseoutcome(1)
mlogit product age absent household i.gender i.position_level  , baseoutcome(1) rrr

checking fitstat consistency and verifying not already installed...
all files already exist and are up to date.


Iteration 0:  Log likelihood = -1595.2728  
Iteration 1:  Log likelihood = -774.56045  
Iteration 2:  Log likelihood = -745.09229  
Iteration 3:  Log likelihood = -743.38342  
Iteration 4:  Log likelihood = -743.38057  
Iteration 5:  Log likelihood = -743.38057  

Multinomial logistic regression                        Number of obs =   1,453
                                                       LR chi2(18)   = 1703.78
                                                       Prob > chi2   =  0.0000
Log likelihood = -743.38057                            Pseudo R2     =  0.5340

------------------------------------------------------------------------------
     product | Coefficient  Std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
A            |  (base outcome)
-------------+----------------------------------------------------------------
B            |
         age |   .2449861   .0154906    15.82   0.000     .2146251    .2753472
      absent |   .0137522   .0131766     1.04   0.297    -.0120734    .0395778
   household |  -.9662861   .0698785   -13.83   0.000    -1.103245   -.8293269
             |
      gender |
       Male  |  -2.369476   .2334001   -10.15   0.000    -2.826932    -1.91202
 Non-binary  |    .240953   1.214742     0.20   0.843    -2.139897    2.621803
             |
position_l~l |
          2  |  -.2033913   .3119391    -0.65   0.514    -.8147807    .4079982
          3  |  -.6830315   .3107755    -2.20   0.028     -1.29214   -.0739228
          4  |  -1.510145   .4024609    -3.75   0.000    -2.298954   -.7213359
          5  |  -1.407559   .4024606    -3.50   0.000    -2.196368   -.6187512
             |
       _cons |  -5.207851   .5493343    -9.48   0.000    -6.284527   -4.131176
-------------+----------------------------------------------------------------
C            |
         age |   .2706643   .0157149    17.22   0.000     .2398636    .3014651
      absent |   .0046137   .0127058     0.36   0.717    -.0202892    .0295166
   household |   .2077009   .0497547     4.17   0.000     .1101834    .3052184
             |
      gender |
       Male  |   .1085684   .1964055     0.55   0.580    -.2763794    .4935162
 Non-binary  |  -1.293302   2.021219    -0.64   0.522    -5.254818    2.668215
             |
position_l~l |
          2  |  -.1427916   .3048914    -0.47   0.640    -.7403678    .4547845
          3  |  -.3147303    .306651    -1.03   0.305    -.9157553    .2862946
          4  |  -.8247833   .3575154    -2.31   0.021    -1.525501    -.124066
          5  |   -.730075   .3794344    -1.92   0.054    -1.473753    .0136028
             |
       _cons |  -10.54573   .6452422   -16.34   0.000    -11.81038   -9.281076
------------------------------------------------------------------------------


Iteration 0:  Log likelihood = -1595.2728  
Iteration 1:  Log likelihood = -774.56045  
Iteration 2:  Log likelihood = -745.09229  
Iteration 3:  Log likelihood = -743.38342  
Iteration 4:  Log likelihood = -743.38057  
Iteration 5:  Log likelihood = -743.38057  

Multinomial logistic regression                        Number of obs =   1,453
                                                       LR chi2(18)   = 1703.78
                                                       Prob > chi2   =  0.0000
Log likelihood = -743.38057                            Pseudo R2     =  0.5340

------------------------------------------------------------------------------
     product |        RRR   Std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
A            |  (base outcome)
-------------+----------------------------------------------------------------
B            |
         age |   1.277604   .0197908    15.82   0.000     1.239397    1.316988
      absent |   1.013847    .013359     1.04   0.297     .9879992    1.040371
   household |   .3804935   .0265883   -13.83   0.000     .3317925    .4363429
             |
      gender |
       Male  |   .0935297   .0218298   -10.15   0.000     .0591942    .1477815
 Non-binary  |   1.272461   1.545712     0.20   0.843     .1176669    13.76052
             |
position_l~l |
          2  |   .8159589   .2545295    -0.65   0.514     .4427364    1.503804
          3  |   .5050835   .1569676    -2.20   0.028     .2746823    .9287434
          4  |    .220878   .0888948    -3.75   0.000     .1003638    .4861024
          5  |   .2447399   .0984981    -3.50   0.000     .1112064    .5386167
             |
       _cons |   .0054734   .0030067    -9.48   0.000     .0018649     .016064
-------------+----------------------------------------------------------------
C            |
         age |   1.310835   .0205997    17.22   0.000     1.271076    1.351838
      absent |   1.004624   .0127646     0.36   0.717     .9799153    1.029957
   household |   1.230845   .0612404     4.17   0.000     1.116483    1.356921
             |
      gender |
       Male  |   1.114681   .2189296     0.55   0.580     .7585251    1.638066
 Non-binary  |   .2743634   .5545485    -0.64   0.522     .0052223    14.41421
             |
position_l~l |
          2  |   .8669347   .2643209    -0.47   0.640     .4769385    1.575834
          3  |   .7299857   .2238508    -1.03   0.305     .4002142    1.331485
          4  |     .43833   .1567097    -2.31   0.021     .2175122    .8833216
          5  |   .4818729   .1828392    -1.92   0.054     .2290642    1.013696
             |
       _cons |   .0000263    .000017   -16.34   0.000     7.43e-06    .0000932
------------------------------------------------------------------------------
Note: _cons estimates baseline relative risk for each outcome.

Interpretation

Product B relative to Product A

\[ \begin{aligned} ln\frac{P(Product=B|X_1,X_2,...,X_n)} {P(Product=A|X_1,X_2,...,X_n)}&=-5.207851 + (0.2449861age)+(0.0137522Absent)-.9662861household-2.36947Male-0.240953Nonbinary -0.2033913Position_2-0.6830315Position_3-1.510145Position_4-1.407559Position_5\\ \frac{P(Product=B|X_1,X_2,...,X_n)} {P(Product=A|X_1,X_2,...,X_n)}&= e^{-5.207851 + (0.2449861age)+(0.0137522Absent)-.9662861household-2.36947Male-0.240953Nonbinary -0.2033913Position_2-0.6830315Position_3-1.510145Position_4-1.407559Position_5\\} \end{aligned}\]

for one unit change in the variable age, the relative risk ratio for choosing Product B relative to product A will be increased significantly by a factor of \(1.277604\) while adjusting for other covariates .Age is statistically significant \((p<0.05)\)
for a 1 day increase in absentism from previous year, the relative risk ratio for choosing product B relative to product A increase by a factor of \(1.013847\) while adjusting for other variables, however absentism is not statistically significant \((p>0.05)\)
If one person is added to the number of individuals in the employee’s household , the Relative Risk ratio of choosing Product B relative to product A decrease by a factor of \(.3804935\) while adjusting for other covariates in the model. Number of individuals in the household is also statistically significant at 5% level of significance.
The relative risk ratio for choosing Product B relative to product A decrease by (1-.0935297):91% for males as compared to females while adjusting for all other covariates.Gender is a statistically significant predictor for choice of insurance product.\((p<0.05)\)
The relative Risk ratio for choosing product B relative to product A increase as position level increase from position level 2 , 3, 4 and 5 relative to position 1 while holding all covariates constant. In essence this means that :

decrease in RRR for Product 2 vs product 1 ->\(0.1840411\)
decrease in RRR for Product 3 vs product 1 ->\(0.4949165^{***}\)
decrease in RRR for Product 4 vs product 1 ->\(0.779122^{***}\)
decrease in RRR for Product 5 vs product 1 ->\(0.7552601^{***}\)

here we not that the starred comparisons are the only significant ones at 5% level of significance.

Product C relative to Product A

\[ \begin{aligned} ln\frac{P(Product=C|X_1,X_2,...,X_n)} {P(Product=A|X_1,X_2,...,X_n)}&=-10.54573+(0.2706643age)+(0.0046137 Absent)+0.2077009household+0.1085684Male-1.293302Nonbinary -0.1427916Position_2-0.3147303Position_3-0.8247833Position_4-0.730075Position_5\\ \quad this~equates~to~RRR~as~follows~\\ \frac{P(Product=C|X_1,X_2,...,X_n)} {P(Product=A|X_1,X_2,...,X_n)}&=e^{-10.54573+(0.2706643age)+(0.0046137 Absent)+0.2077009household+0.1085684Male-1.293302Nonbinary -0.1427916Position_2-0.3147303Position_3-0.8247833Position_4-0.730075Position_5} \end{aligned}\]

for one unit change in the variable age, the relative risk ratio for choosing Product C relative to product A will be increased significantly by a factor of 1.310835 while adjusting for other covariates .Age is statistically significant
for a 1 day increase in absentism from previous year, the relative risk ratio for choosing product C relative to product A increase by a factor of 1.004624 while adjusting for other variables, however absentism is not statistically significant
If one person is added to the number of individuals in the employee’s household , the Relative Risk ratio of choosing Product C relative to product A Increase by a factor of 1.230845 while adjusting for other covariates in the model. Number of individuals in the household is also statistically significant at 5% level of significance.
The relative risk ratio for choosing Product C relative to product A Increase by (1.11468-1):11.5% for males as compared to females while adjusting for all other covariates. Gender is However not a statistically significant predictor for choice of insurance product here.
The relative Risk ratio for choosing product C relative to product A decrease as position level decrease from position level 2 , 3, 4 and 5 relative to position 1 while holding all covariates constant. In essence this means that :

decrease in RRR for Position 2 vs Position 1 ->\(0.1330653\)
decrease in RRR for Position 3 vs Position 1 ->\(0.2700143\)
decrease in RRR for Position 4 vs Position 1 ->\(0.56167^{***}\)
decrease in RRR for Position 5 vs Position 1 ->\(0.5181271\)

here we not that the starred comparisons are the only significant ones at 5% level of significance.

Evaluate and compare the overall fit

Null model vs Age model

use health_insurance
quietly mlogit product, baseoutcome(1) rrr

est store model1

quietly mlogit product age  , baseoutcome(1) rrr
est store model2

**Likelihood ratio test**
  
estout model1 model2 , stats(bic aic)  
  
lrtest model1 model2

                   model1       model2
                        b            b
--------------------------------------
A                                     
age                                  0
_cons                   0            0
--------------------------------------
B                                     
age                           .1969927
_cons           -.0755076    -6.974368
--------------------------------------
C                                     
age                           .2274424
_cons            .0080483    -8.323104
--------------------------------------
bic              3205.108     2371.201
aic              3194.546     2350.076
--------------------------------------


Likelihood-ratio test
Assumption: model1 nested within model2

 LR chi2(2) = 848.47
Prob > chi2 = 0.0000

Note

For the model with age , AIC and BIC reduced significantly indicating that the model with age performs better than a null model.
Through the likelihood ratio test we see that the model with age performs better than the null model (model with no covariates) at 5% level of significance (\(p<0.001\))

Full model vs null model

use health_insurance
quietly mlogit product , baseoutcome(1) rrr

est store model1

quietly mlogit product age absent household i.gender i.position_level  , baseoutcome(1) rrr
est store model2

**Likelihood ratio test**
  
estout model1 model2 , stats(bic aic)  
  
lrtest model1 model2

                   model1       model2
                        b            b
--------------------------------------
A                                     
age                                  0
absent                               0
household                            0
1.gender                             0
2.gender                             0
3.gender                             0
1.position~l                         0
2.position~l                         0
3.position~l                         0
4.position~l                         0
5.position~l                         0
_cons                   0            0
--------------------------------------
B                                     
age                           .2449861
absent                        .0137522
household                    -.9662861
1.gender                             0
2.gender                     -2.369476
3.gender                       .240953
1.position~l                         0
2.position~l                 -.2033913
3.position~l                 -.6830315
4.position~l                 -1.510145
5.position~l                 -1.407559
_cons           -.0755076    -5.207851
--------------------------------------
C                                     
age                           .2706643
absent                        .0046137
household                     .2077009
1.gender                             0
2.gender                      .1085684
3.gender                     -1.293302
1.position~l                         0
2.position~l                 -.1427916
3.position~l                 -.3147303
4.position~l                 -.8247833
5.position~l                  -.730075
_cons            .0080483    -10.54573
--------------------------------------
bic              3205.108     1632.389
aic              3194.546     1526.761
--------------------------------------


Likelihood-ratio test
Assumption: model1 nested within model2

LR chi2(18) = 1703.78
Prob > chi2 =  0.0000

Note

For the model with all variables , AIC and BIC reduced significantly indicating that the full model performs better than a null model.
Through the likelihood ratio test we see that the model with age performs better than the null model (model with no covariates) at 5% level of significance (\(p<0.001\))

Justify your selection

use health_insurance
quietly mlogit product age, baseoutcome(1) rrr

est store model1

quietly mlogit product age absent household i.gender i.position_level  , baseoutcome(1) rrr
est store model2

**Likelihood ratio test**
  
lrtest model1 model2

Likelihood-ratio test
Assumption: model1 nested within model2

LR chi2(16) = 855.31
Prob > chi2 = 0.0000

Note

The results of the likelihood ratio tests can be used to ascertain the significance of predictors to the model. This table tells us that addition of more variables to age had significant main effects on Product choice,

\[\chi^2_{(16)} = 855.31, p = .000\]

use health_insurance

*Measures for model with age alone*
quietly mlogit product age, baseoutcome(1) rrr
fitstat , save

*Model with additional variables
quietly mlogit product age absent household i.gender i.position_level  , baseoutcome(1) rrr
fitstat, dif

Measures of Fit for mlogit of product

Log-Lik Intercept Only:    -1595.273     Log-Lik Full Model:        -1171.038
D(1447):                    2342.076     LR(2):                       848.470
                                         Prob > LR:                     0.000
McFadden's R2:                 0.266     McFadden's Adj R2:             0.262
Maximum Likelihood R2:         0.442     Cragg & Uhler's R2:            0.498
Count R2:                      0.509     Adj Count R2:                  0.255
AIC:                           1.620     AIC*n:                      2354.076
BIC:                       -8194.089     BIC':                       -833.907

(Indices saved in matrix fs_0)



Measures of Fit for mlogit of product

Log-Lik Intercept Only:    -1595.273     Log-Lik Full Model:         -743.381
D(1417):                    1486.761     LR(18):                     1703.784
                                         Prob > LR:                     0.000
McFadden's R2:                 0.534     McFadden's Adj R2:             0.511
Maximum Likelihood R2:         0.690     Cragg & Uhler's R2:            0.777
Count R2:                      0.566     Adj Count R2:                  0.341
AIC:                           1.073     AIC*n:                      1558.761
BIC:                       -8830.962     BIC':                      -1572.719

Comments

The table below and the one above show results for a full model and the one with a single predictor age respectively.
We see that on fitting the full model the following significant changes which show model improvement were noted:

log likelihood increased from \(-1171.038~ to ~-743.381\)
Akaike information criterion decreased from \(2354.076 ~to~ 1558.761\)
Bayesian Information criterion decreased from \(-833.907 ~to~ -1572.719\)

all these results suggest that adding more variables to the model improved model fit. Hence the model with all variables performs better and the best than the model with AGE alone.