age | sex | cp | trestbps | chol | fbs | restecg | thalach | exang | oldpeak | slope | ca | thal | target |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
52 | 1 | 0 | 125 | 212 | 0 | 1 | 168 | 0 | 1.0 | 2 | 2 | 3 | 0 |
53 | 1 | 0 | 140 | 203 | 1 | 0 | 155 | 1 | 3.1 | 0 | 0 | 3 | 0 |
70 | 1 | 0 | 145 | 174 | 0 | 1 | 125 | 1 | 2.6 | 0 | 0 | 3 | 0 |
61 | 1 | 0 | 148 | 203 | 0 | 1 | 161 | 0 | 0.0 | 2 | 1 | 3 | 0 |
62 | 0 | 0 | 138 | 294 | 1 | 1 | 106 | 0 | 1.9 | 1 | 3 | 2 | 0 |
58 | 0 | 0 | 100 | 248 | 0 | 0 | 122 | 0 | 1.0 | 1 | 0 | 2 | 1 |
Generalized Linear Modeling
GLM theory suggests that the canonical link can be modeled as a linear combination of the explanatory variable(s). This approach unifies a number of modeling results used in statistics.
By so doing we have now generalized our modeling to handle non-normal responses. In addition to normally distributed responses, we are able to handle Poisson responses, binomial responses, and more. Writing a pmf or pdf for a response in one-parameter exponential family form reveals the canonical link which can be modeled as a linear function of the predictors.
Exponential Family
The theory of GLMs is developed for data with distribution given y belongs to the exponential family.
The form of the data distribution that is useful for GLMs is
\[ f(y;\theta, \phi) = \exp(\frac{\theta y - b(\theta)}{a(\phi)} + c(y, \phi)) \]
where
- \(\theta\) is called the natural parameter
- \(\phi\) is called the dispersion parameter
Example
if we have \(Y \sim N(\mu, \sigma^2)\)
\[ \begin{aligned} f(y; \mu, \sigma^2) &= \frac{1}{(2\pi \sigma^2)^{1/2}}\exp(-\frac{1}{2\sigma^2}(y- \mu)^2) \\ &= \exp(-\frac{1}{2\sigma^2}(y^2 - 2y \mu +\mu^2)- \frac{1}{2}\log(2\pi \sigma^2)) \\ &= \exp(\frac{y \mu - \mu^2/2}{\sigma^2} - \frac{y^2}{2\sigma^2} - \frac{1}{2}\log(2\pi \sigma^2)) \\ &= \exp(\frac{\theta y - b(\theta)}{a(\phi)} + c(y , \phi)) \end{aligned} \]
where
- \(\theta = \mu\)
- \(b(\theta) = \frac{\mu^2}{2}\)
- \(a(\phi) = \sigma^2 = \phi\)
- \(c(y , \phi) = - \frac{1}{2}(\frac{y^2}{\phi}+\log(2\pi \sigma^2))\)
Properties of GLM exponential families
\(E(Y) = b' (\theta)\) where \(b'(\theta) = \frac{\partial b(\theta)}{\partial \theta}\)
\(var(Y) = a(\phi)b''(\theta)= a(\phi)V(\mu)\).
- \(V(\mu)\) is the variance function; however, it is only the variance in the case that \(a(\phi) =1\)
If \(a(), b(), c()\) are identifiable, we will derive expected value and variance of Y.
Examples
Normal distribution
\[ \begin{aligned} b'(\theta) &= \frac{\partial b(\mu^2/2)}{\partial \mu} = \mu \\ V(\mu) &= \frac{\partial^2 (\mu^2/2)}{\partial \mu^2} = 1 \\ &\to var(Y) = a(\phi) = \sigma^2 \end{aligned} \]
Poisson distribution
\[ \begin{aligned} f(y, \theta, \phi) &= \frac{\mu^y \exp(-\mu)}{y!} \\ &= \exp(y\log(\mu) - \mu - \log(y!)) \\ &= \exp(y\theta - \exp(\theta) - \log(y!)) \end{aligned} \]
where
- \(\theta = \log(\mu)\)
- \(a(\phi) = 1\)
- \(b(\theta) = \exp(\theta)\)
- \(c(y, \phi) = \log(y!)\)
Hence,
\[ \begin{aligned} E(Y) &= \frac{\partial b(\theta)}{\partial \theta} = \exp(\theta) = \mu \\ var(Y) &= \frac{\partial^2 b(\theta)}{\partial \theta^2} = \mu \end{aligned} \]
Since \(\mu = E(Y) = b'(\theta)\)
In GLM, we take some monotone function (typically nonlinear) of \(\mu\) to be linear in the set of covariates
\[ g(\mu) = g(b'(\theta)) = \mathbf{x'\beta} \]
Equivalently,
\[ \mu = g^{-1}(\mathbf{x'\beta}) \]
where \(g(.)\) is the link function since it links mean response (\(\mu = E(Y)\)) and a linear expression of the covariates
Some people use \(\eta = \mathbf{x'\beta}\) where \(\eta\) = the “linear predictor”
GLM is composed of 2 components
The random component:
is the distribution chosen to model the response variables \(Y_1,...,Y_n\)
is specified by the choice fo \(a(), b(), c()\) in the exponential form
Notation:
- Assume that there are n independent response variables \(Y_1,...,Y_n\) with densities
\[ f(y_i ; \theta_i, \phi) = \exp(\frac{\theta_i y_i - b(\theta_i)}{a(\phi)}+ c(y_i, \phi)) \] notice each observation might have different densities - Assume that \(\phi\) is constant for all \(i = 1,...,n\), but \(\theta_i\) will vary. \(\mu_i = E(Y_i)\) for all i.
- Assume that there are n independent response variables \(Y_1,...,Y_n\) with densities
The systematic component
is the portion of the model that gives the relation between \(\mu\) and the covariates \(\mathbf{x}\)
consists of 2 parts:
- the link function, \(g(.)\)
- the linear predictor, \(\eta = \mathbf{x'\beta}\)
Notation:
- assume \(g(\mu_i) = \mathbf{x'\beta} = \eta_i\) where \(\mathbf{\beta} = (\beta_1,..., \beta_p)'\)
- The parameters to be estimated are \(\beta_1,...\beta_p , \phi\)
The Canonical Link
To choose \(g(.)\), we can use canonical link function (Remember: Canonical link is just a special case of the link function)
If the link function \(g(.)\) is such \(g(\mu_i) = \eta_i = \theta_i\), the natural parameter, then \(g(.)\) is the canonical link.
The inverse link
\(g^{-1}(.)\) is also known as the mean function, take linear predictor output (ranging from \(-\infty\) to \(\infty\)) and transform it into a different scale.
Exponential: converts \(\mathbf{\beta X}\) into a curve that is restricted between 0 and \(\infty\) (which you can see that is useful in case you want to convert a linear predictor into a non-negative value). \(\lambda = \exp(y) = \mathbf{\beta X}\)
Inverse Logit (also known as logistic): converts \(\mathbf{\beta X}\) into a curve that is restricted between 0 and 1, which is useful in case you want to convert a linear predictor to a probability. \(\theta = \frac{1}{1 + \exp(-y)} = \frac{1}{1 + \exp(- \mathbf{\beta X})}\)
- \(y\) = linear predictor value
- \(\theta\) = transformed value
The identity link is that
\[ \begin{aligned} \eta_i &= g(\mu_i) = \mu_i \\ \mu_i &= g^{-1}(\eta_i) = \eta_i\\ \end{aligned} \]
Examples
Normal random component
Mean Response: \(\mu_i = \theta_i\)
Canonical Link: \(g( \mu_i) = \mu_i\) (the identity link)
Binomial random component
Mean Response: \(\mu_i = \frac{n_i \exp( \theta)}{1+\exp (\theta_i)}\) and \(\theta(\mu_i) = \log(\frac{p_i }{1-p_i}) = \log (\frac{\mu_i} {n_i - \mu_i})\)
Canonical link: \(g(\mu_i) = \log(\frac{\mu_i} {n_i - \mu_i})\) (logit link)
Poisson random component
Mean Response: \(\mu_i = \exp(\theta_i)\)
Canonical Link: \(g(\mu_i) = \log(\mu_i)\)
Gamma random component:
Mean response: \(\mu_i = -\frac{1}{\theta_i}\) and \(\theta(\mu_i) = - \mu_i^{-1}\)
Canonical Link: \(g(\mu\_i) = - \frac{1}{\mu_i}\)
Inverse Gaussian random
- Canonical Link: \(g(\mu_i) = \frac{1}{\mu_i^2}\)
Logistic Regression
If \(X\) follows a binomial distribution with \(n\) trials and probability of success \(p\), we can write:
\[\begin{align*} P(X=x)&= \binom{n}{x}p^x(1-p)^{(n-x)} \\ &=e^{x\log(p) + (n-x)\log(1-p) + \log\binom{n}{x}} \end{align*}\] However, this probability mass function is not quite in one-parameter exponential family form. Note that there are two terms in the exponent which consist of a product of functions of \(x\) and \(p\). So more simplification is in order:
\[\begin{equation*} P(X=x) = e^{x\log\left(\frac{p}{1-p}\right) + n\log(1-p)+ \log\binom{n}{x}} \end{equation*}\].
The one-parameter exponential family form for binomial responses shows that the canonical link is \(\log\left(\frac{p}{1-p}\right)\). Thus, GLM theory suggests that constructing a model using the logit, the log odds of a score, as a linear function of covariates is a reasonable approach.
\[ p_i = f(\mathbf{x}_i ; \beta) = \frac{exp(\mathbf{x_i'\beta})}{1 + exp(\mathbf{x_i'\beta})} \]
Equivalently,
\[ logit(p_i) = log(\frac{p_i}{1-p_i}) = \mathbf{x_i'\beta} \]
where \(\frac{p_i}{1-p_i}\)is the odds.
In this form, the model is specified such that a function of the mean response is linear. Hence, Generalized Linear Models
The likelihood function
\[L(p_i) = \prod_{i=1}^{n} p_i^{Y_i}(1-p_i)^{1-Y_i}\]
where \[p_i = \frac{\mathbf{exp(x'_i \beta})}{1+\mathbf{exp(x'_i \beta)}}\] and \(1-p_i = \frac{1}{1+ exp(\mathbf{x'_i \beta})}\)
Hence, our objective function is
\[ \begin{aligned} Q(\beta) &= log(L(p_i))\\ &= log\left(p_i^{\sum Y_i}(1-p_i)^{\sum(1-Y_i)}\right)\\ &= log\left(\frac{\mathbf{exp(x'_i \beta})}{1+\mathbf{exp(x'_i \beta)}})^{\sum Y_i}(\frac{1}{1+ exp(\mathbf{x'_i \beta})})^{\sum(1-Y_i)}\right)\\ &=\sum Y_ix'_i\beta - \sum Y_ilog(1+\exp(x'_i\beta))- \sum(1-Y_i)log(1+\exp(x'_i\beta))\\ &=\sum Y_ix'_i\beta - \sum Y_ilog(1+\exp(x'_i\beta))+\sum Y_ilog(1+\exp(x'_i\beta))- \sum log(1+\exp(x'_i\beta))\\ &= \sum_{i=1}^n Y_i \mathbf{x'_i \beta} - \sum_{i=1}^n log(1+ exp(\mathbf{x'_i \beta})) \end{aligned} \]
we could maximize this function numerically using the optimization method above, which allows us to find numerical MLE for \(\hat{\beta}\). Then we can use the standard asymptotic properties of MLEs to make inference.
Property of MLEs is that parameters are asymptotically unbiased with sample variance-covariance matrix given by the inverse Fisher information matrix
\[ \hat{\beta} \dot{\sim} AN(\beta,[\mathbf{I}(\beta)]^{-1}) \]
where the Fisher Information matrix, \(\mathbf{I}(\beta)\) is
\[ \begin{aligned} \mathbf{I}(\beta) &= E[\frac{\partial \log(L(\beta))}{\partial (\beta)}\frac{\partial \log(L(\beta))}{\partial \beta'}] \\ &= E[(\frac{\partial \log(L(\beta))}{\partial \beta_i} \frac{\partial \log(L(\beta))}{\partial \beta_j})_{ij}] \end{aligned} \]
Under regularity conditions, this is equivalent to the negative of the expected value of the Hessian Matrix
\[ \begin{aligned} \mathbf{I}(\beta) &= -E[\frac{\partial^2 \log(L(\beta))}{\partial \beta \partial \beta'}] \\ &= -E[(\frac{\partial^2 \log(L(\beta))}{\partial \beta_i \partial \beta_j})_{ij}] \end{aligned} \]
Example:
\[ x_i' \beta = \beta_0 + \beta_1 x_i \]
\[ \begin{aligned} - \frac{\partial^2 \ln(L(\beta))}{\partial \beta^2_0} &= \sum_{i=1}^n \frac{\exp(x'_i \beta)}{1 + \exp(x'_i \beta)} - [\frac{\exp(x_i' \beta)}{1+ \exp(x'_i \beta)}]^2 = \sum_{i=1}^n p_i (1-p_i) \\ - \frac{\partial^2 \ln(L(\beta))}{\partial \beta^2_1} &= \sum_{i=1}^n \frac{x_i^2\exp(x'_i \beta)}{1 + \exp(x'_i \beta)} - [\frac{x_i\exp(x_i' \beta)}{1+ \exp(x'_i \beta)}]^2 = \sum_{i=1}^n x_i^2p_i (1-p_i) \\ - \frac{\partial^2 \ln(L(\beta))}{\partial \beta_0 \partial \beta_1} &= \sum_{i=1}^n \frac{x_i\exp(x'_i \beta)}{1 + \exp(x'_i \beta)} - x_i[\frac{\exp(x_i' \beta)}{1+ \exp(x'_i \beta)}]^2 = \sum_{i=1}^n x_ip_i (1-p_i) \\ \end{aligned} \]
Hence,
\[ \mathbf{I} (\beta) = \left[ \begin{array} {cc} \sum_i p_i(1-p_i) & \sum_i x_i p_i(1-p_i) \\ \sum_i x_i p_i(1-p_i) & \sum_i x_i^2 p_i(1-p_i) \end{array} \right] \]
Inference
Likelihood Ratio Tests
To formulate the test, let \(\beta = [\beta_1', \beta_2']'\). If you are interested in testing a hypothesis about \(\beta_1\), then we leave \(\beta_2\) unspecified (called nuisance parameters). \(\beta_1\) and \(\beta_2\) can either a vector or scalar, or \(\beta_2\) can be null.
Example: \(H_0: \beta_1 = \beta_{1,0}\) (where \(\beta_{1,0}\) is specified) and \(\hat{\beta}_{2,0}\) be the MLE of \(\beta_2\) under the restriction that \(\beta_1 = \beta_{1,0}\). The likelihood ratio test statistic is
\[ -2\log\Lambda = -2[\log(L(\beta_{1,0},\hat{\beta}_{2,0})) - \log(L(\hat{\beta}_1,\hat{\beta}_2))] \]
where
- the first term is the value fo the likelihood for the fitted restricted model
- the second term is the likelihood value of the fitted unrestricted model
Under the null,
\[ -2 \log \Lambda \sim \chi^2_{\upsilon} \]
where \(\upsilon\) is the dimension of \(\beta_1\)
We reject the null when \(-2\log \Lambda > \chi_{\upsilon,1-\alpha}^2\)
Wald Statistics
Based on
\[ \hat{\beta} \sim AN (\beta, [\mathbf{I}(\beta)^{-1}]) \]
\[ H_0: \mathbf{L}\hat{\beta} = 0 \]
where \(\mathbf{L}\) is a q x p matrix with q linearly independent rows. Then
\[ W = (\mathbf{L\hat{\beta}})'(\mathbf{L[I(\hat{\beta})]^{-1}L'})^{-1}(\mathbf{L\hat{\beta}}) \]
under the null hypothesis
Confidence interval
\[ \hat{\beta}_i \pm 1.96 \hat{s}_{ii}^2 \]
where \(\hat{s}_{ii}^2\) is the i-th diagonal of \(\mathbf{[I(\hat{\beta})]}^{-1}\)
If you have
- large sample size, the likelihood ratio and Wald tests have similar results.
- small sample size, the likelihood ratio test is better.
Logistic Regression: Interpretation of \(\beta\)
For single regressor, the model is
\[ logit\{\hat{p}_{x_i}\} \equiv logit (\hat{p}_i) = \log(\frac{\hat{p}_i}{1 - \hat{p}_i}) = \hat{\beta}_0 + \hat{\beta}_1 x_i \]
When \(x= x_i + 1\)
\[ logit\{\hat{p}_{x_i +1}\} = \hat{\beta}_0 + \hat{\beta}(x_i + 1) = logit\{\hat{p}_{x_i}\} + \hat{\beta}_1 \]
Then,
\[ logit\{\hat{p}_{x_i +1}\} - logit\{\hat{p}_{x_i}\} = log\{odds[\hat{p}_{x_i +1}]\} - log\{odds[\hat{p}_{x_i}]\} \\ = log(\frac{odds[\hat{p}_{x_i + 1}]}{odds[\hat{p}_{x_i}]}) = \hat{\beta}_1 \]
and
\[ exp(\hat{\beta}_1) = \frac{odds[\hat{p}_{x_i + 1}]}{odds[\hat{p}_{x_i}]} \]
the estimated odds ratio
the estimated odds ratio, when there is a difference of c units in the regressor x, is \(exp(c\hat{\beta}_1)\). When there are multiple covariates, \(exp(\hat{\beta}_k)\) is the estimated odds ratio for the variable \(x_k\), assuming that all of the other variables are held constant.
Residuals for Binomial Regression
With LLSR, residuals were used to assess model assumptions and identify outliers. For binomial regression, two different types of residuals are typically used. One residual, the Pearson residual, has a form similar to that used with LLSR. Specifically, the Pearson residual is calculated using:
\[\begin{equation*} \textrm{Pearson residual}_i = \frac{\textrm{actual count}-\textrm{predicted count}}{\textrm{SD of count}} = \frac{Y_i-m_i\hat{p_i}}{\sqrt{m_i\hat{p_i}(1-\hat{p_i})}} \end{equation*}\] where \(m_i\) is the number of trials for the \(i^{th}\) observation and \(\hat{p}_i\) is the estimated probability of success for that same observation.
A deviance residual is an alternative residual for binomial regression based on the discrepancy between the observed values and those estimated using the likelihood. A deviance residual can be calculated for each observation using:
\[\begin{equation*} \textrm{d}_i = \textrm{sign}(Y_i-m_i\hat{p_i})\sqrt{2[Y_i \log\left(\frac{Y_i}{m_i \hat{p_i}}\right)+ (m_i - Y_i) \log\left(\frac{m_i - Y_i}{m_i - m_i \hat{p_i}}\right)]} \end{equation*}\]
When the number of trials is large for all of the observations and the models are appropriate, both sets of residuals should follow a standard normal distribution.
The sum of the individual deviance residuals is referred to as the deviance or residual deviance. The residual deviance is used to assess the model. As the name suggests, a model with a small deviance is preferred. In the case of binomial regression, when the denominators, \(m_i\), are large and a model fits, the residual deviance follows a \(\chi^2\) distribution with \(n-p\) degrees of freedom (the residual degrees of freedom). Thus for a good fitting model the residual deviance should be approximately equal to its corresponding degrees of freedom. When binomial data meets these conditions, the deviance can be used for a goodness-of-fit test. The p-value for lack-of-fit is the proportion of values from a \(\chi_{n-p}^2\) distribution that are greater than the observed residual deviance.
Application in R and STATA
single variable discriptive statistics
quietly import delimited heart
label define sex 1 "male" 0 "female", replace
label value sex sex
label define fbs 1 "true" 0 "false", replace
label value fbs fbs
label define target 0 "No disease" 1 "Disease", replace
label value target target
label define thal 1 "normal" 2 "fixed defect" 3 "reversible defect", replace
label value thal thal
label define exang 1 "yes" 0 "no", replace
label value exang exang
quietly save heart , replace
quietly import delimited heart
factor(exang sex target thal fbs restecg cp ca slope) title(Table1) titlestyles( font(, bold) ) export("Statin_table", as(docx) replace) dtable, continuous(age trestbps chol thalach oldpeak)
Table1
-------------------------
Summary
-------------------------
N 1,025
age 54.434 (9.072)
trestbps 131.612 (17.517)
chol 246.000 (51.593)
thalach 149.114 (23.006)
oldpeak 1.072 (1.175)
exang
0 680 (66.3%)
1 345 (33.7%)
sex
0 312 (30.4%)
1 713 (69.6%)
target
0 499 (48.7%)
1 526 (51.3%)
thal
0 7 (0.7%)
1 64 (6.2%)
2 544 (53.1%)
3 410 (40.0%)
fbs
0 872 (85.1%)
1 153 (14.9%)
restecg
0 497 (48.5%)
1 513 (50.0%)
2 15 (1.5%)
cp
0 497 (48.5%)
1 167 (16.3%)
2 284 (27.7%)
3 77 (7.5%)
ca
0 578 (56.4%)
1 226 (22.0%)
2 134 (13.1%)
3 69 (6.7%)
4 18 (1.8%)
slope
0 74 (7.2%)
1 482 (47.0%)
2 469 (45.8%)
-------------------------
(collection DTable exported to file Statin_table.docx)
- The table above shows frequencies and proportions for categorical data.
- Continuous data was summarised by means and standard deviations
use heart
quietly histogram age ,title("Histogram of Age") note( "Source : Data source unknown") ytitle("Fraction") normal ytitle("Density") name(graph1 , replace)
quietly histogram trestbps ,title("Histogram of trestbps") note( "Source : Data source unknown") ytitle("Fraction") normal ytitle("Density") name(graph2 , replace)
quietly histogram chol ,title("Histogram of chol") note( "Source : Data source unknown") ytitle("Fraction") normal ytitle("Density") name(graph3 , replace)
quietly histogram thalach ,title("Histogram of thalach") note( "Source : Data source unknown") ytitle("Fraction") normal ytitle("Density") name(graph4 , replace)
quietly histogram oldpeak ,title("Histogram of oldpeak") note( "Source : Data source unknown") ytitle("Fraction") normal ytitle("Density") name(graph5 , replace)
graph pie, over(sex) name(graph6 , replace) title("distribution of sex")
graph pie, over(exang) name(graph7 , replace) title("distribution of exang")
graph hbar (count), over(cp) blabel(bar) name(graph8 , replace) title("distribution of cp")
graph hbar (count), over(ca) blabel(bar) name(graph9 , replace) title("distribution of ca")
graph hbar (count), over(fbs) blabel(bar) name(graph10 , replace) title("distribution of fbs")
graph hbar (count), over(thal) blabel(bar) name(graph11 , replace) title("distribution of thal")
graph hbar (count), over(restecg) blabel(bar) name(graph12 , replace) title("distribution of restecg")
graph hbar (count), over(slope) blabel(bar) name(graph13 , replace) title("distribution of slope")
graph pie, over(target) name(graph14 , replace) title("distribution of target")
graph combine graph1 graph2 graph3 graph4 graph5 graph6 graph7 graph8 graph9 graph10 graph11 graph12 graph13 graph14 ,title("")
quietly graph export all_graphs.svg, replace
Summary statistics by Outcome Variable
quietly import delimited heart
by(target, tests testnotes) continuous(age trestbps chol thalach oldpeak) factor(exang sex thal fbs restecg cp ca slope) title(Table1) titlestyles( font(, bold) ) export("target", as(docx) replace) dtable,
note: using test regress across levels of target for age, trestbps, chol,
thalach, and oldpeak.
note: using test pearson across levels of target for exang, sex, thal, fbs,
restecg, cp, ca, and slope.
Table1
------------------------------------------------------------------
target
0 1 Total Test
------------------------------------------------------------------
N 499 (48.7%) 526 (51.3%) 1,025 (100.0%)
age 56.569 (7.908) 52.409 (9.632) 54.434 (9.072) <0.001
trestbps 134.106 (18.577) 129.245 (16.112) 131.612 (17.517) <0.001
chol 251.293 (49.559) 240.979 (53.010) 246.000 (51.593) 0.001
thalach 139.130 (22.565) 158.586 (19.097) 149.114 (23.006) <0.001
oldpeak 1.600 (1.291) 0.570 (0.771) 1.072 (1.175) <0.001
exang
0 225 (45.1%) 455 (86.5%) 680 (66.3%) <0.001
1 274 (54.9%) 71 (13.5%) 345 (33.7%)
sex
0 86 (17.2%) 226 (43.0%) 312 (30.4%) <0.001
1 413 (82.8%) 300 (57.0%) 713 (69.6%)
thal
0 4 (0.8%) 3 (0.6%) 7 (0.7%) <0.001
1 43 (8.6%) 21 (4.0%) 64 (6.2%)
2 132 (26.5%) 412 (78.3%) 544 (53.1%)
3 320 (64.1%) 90 (17.1%) 410 (40.0%)
fbs
0 417 (83.6%) 455 (86.5%) 872 (85.1%) 0.188
1 82 (16.4%) 71 (13.5%) 153 (14.9%)
restecg
0 283 (56.7%) 214 (40.7%) 497 (48.5%) <0.001
1 204 (40.9%) 309 (58.7%) 513 (50.0%)
2 12 (2.4%) 3 (0.6%) 15 (1.5%)
cp
0 375 (75.2%) 122 (23.2%) 497 (48.5%) <0.001
1 33 (6.6%) 134 (25.5%) 167 (16.3%)
2 65 (13.0%) 219 (41.6%) 284 (27.7%)
3 26 (5.2%) 51 (9.7%) 77 (7.5%)
ca
0 163 (32.7%) 415 (78.9%) 578 (56.4%) <0.001
1 160 (32.1%) 66 (12.5%) 226 (22.0%)
2 113 (22.6%) 21 (4.0%) 134 (13.1%)
3 60 (12.0%) 9 (1.7%) 69 (6.7%)
4 3 (0.6%) 15 (2.9%) 18 (1.8%)
slope
0 46 (9.2%) 28 (5.3%) 74 (7.2%) <0.001
1 324 (64.9%) 158 (30.0%) 482 (47.0%)
2 129 (25.9%) 340 (64.6%) 469 (45.8%)
------------------------------------------------------------------
(collection DTable exported to file target.docx)
- Preliminary results from Chi-square tests of association and student T.tests indicate that at \(5\%\) level of significance, most variables are are associated with being diagnosed or not being diagnosed with the disease.
use heart
graph box age , over(target) title("distribution of Age by target") note( "Source : Data source unknown") ytitle("Fraction") ytitle("Density") name(graph1 , replace)
graph box trestbps , over(target) title("distribution of trestbps by target") note( "Source : Data source unknown") ytitle("Fraction") ytitle("Density") name(graph2 , replace)
graph box chol ,over(target) title("distribution of chol by target") note( "Source : Data source unknown") ytitle("Fraction") ytitle("Density") name(graph3 , replace)
graph box thalach ,over(target) title("distribution of thalach by target") note( "Source : Data source unknown") ytitle("Fraction") ytitle("Density") name(graph4 , replace)
graph box oldpeak ,over(target) title("distribution of oldpeak by target") note( "Source : Data source unknown") ytitle("Fraction") ytitle("Density") name(graph5 , replace)
graph hbar ,over(target) over(sex) stack asyvars name(graph6 , replace) title("distribution of sex")
graph hbar ,over(target) over(exang) stack asyvars name(graph7 , replace) title("distribution of exang")
graph hbar ,over(target) over(cp) stack asyvars name(graph8 , replace) title("distribution of cp")
graph hbar ,over(target) over(ca) stack asyvars name(graph9 , replace) title("distribution of ca")
graph hbar,over(target) over(fbs) stack asyvars name(graph10 , replace) title("distribution of fbs")
graph hbar ,over(target) over(thal) stack asyvars name(graph11 , replace) title("distribution of thal")
graph hbar ,over(target) over(restecg) stack asyvars name(graph12 , replace) title("distribution of restecg")
graph hbar ,over(target) over(slope) stack asyvars name(graph13 , replace) title("distribution of slope")
graph combine graph1 graph2 graph3 graph4 graph5 graph6 graph7 graph8 graph9 graph10 graph11 graph12 graph13 ,title("")
quietly graph export all_target.svg, replace
Fit a simple logistic regression model using age as the only risk factor of heart disease status
A logistic regression model is a generalized linear model and is made of a linear predictor:
\[\underbrace{g(\mu_i)}_{Link~~function} = \underbrace{\beta_0 + \beta_1X_1~+~...~+~\beta_pX_p}_{Linear~component}\]
quietly import delimited heart
logit target age
Iteration 0: Log likelihood = -710.12021
Iteration 1: Log likelihood = -682.58876
Iteration 2: Log likelihood = -682.53284
Iteration 3: Log likelihood = -682.53284
Logistic regression Number of obs = 1,025
LR chi2(1) = 55.17
Prob > chi2 = 0.0000
Log likelihood = -682.53284 Pseudo R2 = 0.0388
------------------------------------------------------------------------------
target | Coefficient Std. err. z P>|z| [95% conf. interval]
-------------+----------------------------------------------------------------
age | -.0530036 .0073802 -7.18 0.000 -.0674686 -.0385386
_cons | 2.943003 .4084775 7.20 0.000 2.142402 3.743605
------------------------------------------------------------------------------
<-glm(target~ age,data=new,family=binomial)
modsummary(mod)
Call:
glm(formula = target ~ age, family = binomial, data = new)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.94300 0.40848 7.205 0.000000000000581 ***
age -0.05300 0.00738 -7.182 0.000000000000688 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1420.2 on 1024 degrees of freedom
Residual deviance: 1365.1 on 1023 degrees of freedom
AIC: 1369.1
Number of Fisher Scoring iterations: 4
- the equation below shows a simple logistic regression model fit on age as the predictor.
\[ \begin{aligned} \log(\frac{\mu_i}{1-\mu_i})= 2.943003-.0530036Age \end{aligned}\]
(b) Report the estimated coefficient for age, and interpret its sign and meaning in the context of heart disease risk
- the model for the output above can be written as follows
\[ \begin{aligned} \text{logit}(\mu_i) &= \log(\frac{\mu_i}{1-\mu_i})\\ &= 2.943003-.0530036Age \end{aligned}\]
- The estimated coefficient for age is
-0.0530036
and it is negative - Here we see that for a 1 year increase in age , the log odds of being diagnosed with heart disease decrease significantly by
0.0530036
- Thus the chances of developing heart disease decrease with age
Is Age significantly associated with heart disease?
- Age is indeed significantly associated with probability of developing heart disease at 5% level of significance since the corresponding p-value is 0.000 and is less than
0.05
- the coefficient’s confidence is given by \([-0.0674686 ,-0.0385386]\) and does not contain zero hence the variable is significant.
\[ \begin{aligned} \log(\frac{\mu_i}{1-\mu_i})&= 2.943003-.0530036Age\\ \frac{\mu_i}{1-\mu_i}&=e^{2.943003-.0530036Age}\\ \frac{\mu_i}{1-\mu_i}&=e^{2.943003}.e^{-.0530036Age} \end{aligned}\]
- the odds associated with age is therefore \(e^{-.0530036}=0.9483766\) meaning that for a 1 year increase in age the odds of being diagnosed with heart disease decrease by \((1-0.9483766)*100\% \approx 5.16\%\)
Code below shows how to get the estimated odds ratios
quietly import delimited heart
logistic target age
Logistic regression Number of obs = 1,025
LR chi2(1) = 55.17
Prob > chi2 = 0.0000
Log likelihood = -682.53284 Pseudo R2 = 0.0388
------------------------------------------------------------------------------
target | Odds ratio Std. err. z P>|z| [95% conf. interval]
-------------+----------------------------------------------------------------
age | .9483766 .0069993 -7.18 0.000 .9347571 .9621946
_cons | 18.97274 7.749939 7.20 0.000 8.51988 42.25001
------------------------------------------------------------------------------
Note: _cons estimates baseline odds.
- also note that we can find the probability of being diagnosed at a given age by the following probability formula.
\[ \begin{aligned} \hat{p}(x) &= \frac{exp^{\hat{\beta}_{0} + \hat{\beta}_{1}{age}}}{1 + exp^{\hat{\beta}_{0} + \hat{\beta}_{1}{age}}}\\ &= \frac{exp^{2.943003-.0530036Age}}{1 + exp^{2.943003-.0530036Age}} \end{aligned}\]
likelihood ratio test
quietly import delimited heart
quietly logit target
est store model1
quietly logit target age
est store model2
lrtest model1 model2
Likelihood-ratio test
Assumption: model1 nested within model2
LR chi2(1) = 55.17
Prob > chi2 = 0.0000
<-glm(target~ 1,data=new,family=binomial)
mod0<-glm(target~ age,data=new,family=binomial)
mod1anova(mod0,mod1,test="Chisq")
Resid. Df | Resid. Dev | Df | Deviance | Pr(>Chi) |
---|---|---|---|---|
1024 | 1420.240 | NA | NA | NA |
1023 | 1365.066 | 1 | 55.17475 | 0 |
- The likelihood ratio test above compares the null model with a model with age alone.
- the \(p-value<0.001\) indicates that the model with age is better fit compared to a model with no covariates at all.
- thus our current model fits the data better than a model with no predictors.
Add sex variable
quietly import delimited heart
logit target age i.sex
Iteration 0: Log likelihood = -710.12021
Iteration 1: Log likelihood = -630.0552
Iteration 2: Log likelihood = -629.90565
Iteration 3: Log likelihood = -629.90562
Logistic regression Number of obs = 1,025
LR chi2(2) = 160.43
Prob > chi2 = 0.0000
Log likelihood = -629.90562 Pseudo R2 = 0.1130
------------------------------------------------------------------------------
target | Coefficient Std. err. z P>|z| [95% conf. interval]
-------------+----------------------------------------------------------------
age | -.0674349 .0080345 -8.39 0.000 -.0831823 -.0516875
1.sex | -1.533812 .1585261 -9.68 0.000 -1.844518 -1.223107
_cons | 4.819817 .4868406 9.90 0.000 3.865627 5.774007
------------------------------------------------------------------------------
<-glm(target~ age+sex,data=new,family=binomial)
mod2summary(mod2)
Call:
glm(formula = target ~ age + sex, family = binomial, data = new)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 4.819818 0.486841 9.900 <0.0000000000000002 ***
age -0.067435 0.008035 -8.393 <0.0000000000000002 ***
sex -1.533812 0.158526 -9.675 <0.0000000000000002 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1420.2 on 1024 degrees of freedom
Residual deviance: 1259.8 on 1022 degrees of freedom
AIC: 1265.8
Number of Fisher Scoring iterations: 4
- the model for the output above can be written as follows
\[ \begin{aligned} \text{logit}(\mu_i) &= \log(\frac{\mu_i}{1-\mu_i})\\ &= 4.819817-.0674349Age-1.533812Male \end{aligned}\]
- The estimated coefficient for age is
-.0674349
and it is negative - Here we see that for a 1 year increase in age , the log odds of being diagnosed with heart disease decrease significantly by
0.0674349
while adjusting for gender of the patient. - whilst Adjusting for Age, males have
1.533812
less log odds of being diagnosed with heart disease as compared to females. - the two variables are significant at 5% level of significance since they have p-values less than 5%
Confidence Intervals for Model Coefficients
Since the Wald statistic follows a normal distribution with \(n\) large, we could generate a Wald-type (normal-based) confidence interval for \(\beta_2\) using:
\(\hat\beta_2 \pm 1.96\cdot\textrm{SE}(\hat\beta_2)\) and then exponentiating endpoints if we prefer a confidence interval for the odds ratio \(e^{\beta_2}\). In this case,
\[ \begin{align*} 95\% \textrm{ CI for } \beta_2 &= \hat{\beta}_2 \pm 1.96 \cdot \textrm{SE}(\hat{\beta}_2) \\ &= -1.533812 \pm 1.96 \cdot 0.158526 \\ &= -0.0132 \pm 0.00764 \\ &= (-1.223101, -1.844523) \\ 95\% \textrm{ CI for } e^{\beta_2} &= (e^{-1.223101}, e^{-1.844523}) \\ &= (0.2943161, 0.1581007) \\ \end{align*} \]
confint(mod2)
Waiting for profiling to be done...
2.5 % 97.5 %
(Intercept) 3.88237336 5.79217853
age -0.08342613 -0.05190659
sex -1.84937968 -1.22746318
exp(confint(mod2))
Waiting for profiling to be done...
2.5 % 97.5 %
(Intercept) 48.5392799 327.7262070
age 0.9199590 0.9494175
sex 0.1573347 0.2930350
Thus, we can be 95% confident that being male is associated with a 84% to 70.7% decrease in the odds of dying after controlling age.
Multinomial logistic Regression
If we have more than two categories or groups that we want to model relative to covariates (e.g., we have observations \(i = 1,…,n\) and groups/ covariates \(j = 1,2,…,J\)), multinomial is our candidate model
Let
- \(p_{ij}\) be the probability that the i-th observation belongs to the j-th group
- \(Y_{ij}\) be the number of observations for individual i in group j; An individual will have observations \(Y_{i1},Y_{i2},…Y_{iJ}\)
- assume the probability of observing this response is given by a multinomial distribution in terms of probabilities \(p_{ij}\), where \(\sum_{j = 1}^J p_{ij} = 1\) . For interpretation, we have a baseline category \(p_{i1} = 1 - \sum_{j = 2}^J p_{ij}\)
The link between the mean response (probability) \(p_{ij}\) and a linear function of the covariates
\[ \eta_{ij} = \mathbf{x'_i \beta_j} = \log \frac{p_{ij}}{p_{i1}}, j = 2,..,J \]
We compare \(p_{ij}\) to the baseline \(p_{i1}\), suggesting
\[ p_{ij} = \frac{\exp(\eta_{ij})}{1 + \sum_{i=2}^J \exp(\eta_{ij})} \]
which is known as multinomial logistic model.
Note:
- Softmax coding for multinomial logistic regression: rather than selecting a baseline class, we treat all K class symmetrically - equally important (no baseline).
\[ P(Y = k | X = x) = \frac{exp(\beta_{k1} + \dots + \beta_{k_p x_p})}{\sum_{l = 1}^K exp(\beta_{l0} + \dots + \beta_{l_p x_p})} \]
then the log odds ratio between k-th and k’-th classes is
\[ \log (\frac{P(Y=k|X=x)}{P(Y = k' | X=x)}) = (\beta_{k0} - \beta_{k'0}) + \dots + (\beta_{kp} - \beta_{k'p}) x_p \]
About the dataset
You are an analyst at a large technology company. The company recently introduced a new health insurance provider for its employees. At the beginning of the year the employees had to choose one of three different health plan products from this provider to best suit their needs. You have been asked to determine which factors influenced the choice in product.
The
health_insurance` data set consists of the following fields:
product
: The choice of product of the individual—A, B or Cage
: The age of the individual when they made the choicegender
: The gender of the individual as stated when they made the choicehousehold
: The number of people living with the individual in the same household at the time of the choiceposition_level
: Position level in the company at the time they made the choice, where 1 is is the lowest and 5 is the highestabsent
: The number of days the individual was absent from work in the year prior to the choice
Descriptive statistics for single variables
quietly import delimited health_insurance
egen product2 = group(product), label
egen position_level2 = group(position_level), label
egen gender2 = group(gender), label
drop product
drop position_level
drop gender
rename product2 product
rename gender2 gender
rename position_level2 position_level
quietly save health_insurance , replace
factor(gender product position_level) title(Table1) titlestyles( font(, bold) ) export("Statin_table", as(docx) replace) dtable, continuous(age absent household)
Table1
-------------------------------------
Summary
-------------------------------------
N 1,453
age 40.919 (13.530)
absent 14.467 (8.114)
household 3.257 (2.231)
group(gender)
Female 889 (61.2%)
Male 559 (38.5%)
Non-binary 5 (0.3%)
group(product)
A 495 (34.1%)
B 459 (31.6%)
C 499 (34.3%)
group(position_level)
1 184 (12.7%)
2 404 (27.8%)
3 436 (30.0%)
4 231 (15.9%)
5 198 (13.6%)
-------------------------------------
(collection DTable exported to file Statin_table.docx)
Descriptive plots for single variables
use health_insurance
quietly histogram age ,title("Histogram of Age") note( "Source : Data source unknown") ytitle("Fraction") normal ytitle("Density") name(graph1 , replace)
quietly histogram absent ,title("Histogram of absentism") note( "Source : Data source unknown") ytitle("Fraction") normal ytitle("Density") name(graph2 , replace)
quietly histogram household ,title("Histogram of number of households") note( "Source : Data source unknown") ytitle("Fraction") normal ytitle("Density") name(graph3 , replace)
graph pie, over(position_level) name(graph4 , replace) title("distribution of Position level")
graph hbar (count), over(gender) blabel(bar) name(graph5 , replace) title("distribution of Gender")
graph hbar (count), over(product) blabel(bar) name(graph6 , replace) title("distribution of Products")
graph combine graph1 graph2 graph3 graph4 graph5 graph6 ,title("")
quietly graph export all_multi.svg, replace
Descriptive statistics by outcome variable
use health_insurance
by(product, tests testnotes) continuous(age absent household) factor(gender position_level) title(Table1) titlestyles( font(, bold) ) export("Statin_table", as(docx) replace) dtable,
note: using test regress across levels of product for age, absent, and
household.
note: using test pearson across levels of product for gender and
position_level.
Table1
-------------------------------------------------------------------------------------------
group(product)
A B C Total Test
-------------------------------------------------------------------------------------------
N 495 (34.1%) 459 (31.6%) 499 (34.3%) 1,453 (100.0%)
age 28.913 (5.158) 44.865 (13.756) 49.198 (10.346) 40.919 (13.530) <0.001
absent 14.451 (8.030) 14.394 (8.092) 14.549 (8.232) 14.467 (8.114) 0.956
household 3.642 (2.339) 1.462 (1.201) 4.527 (1.740) 3.257 (2.231) <0.001
group(gender)
Female 244 (49.3%) 404 (88.0%) 241 (48.3%) 889 (61.2%) <0.001
Male 249 (50.3%) 53 (11.5%) 257 (51.5%) 559 (38.5%)
Non-binary 2 (0.4%) 2 (0.4%) 1 (0.2%) 5 (0.3%)
group(position_level)
1 66 (13.3%) 59 (12.9%) 59 (11.8%) 184 (12.7%) 0.848
2 136 (27.5%) 133 (29.0%) 135 (27.1%) 404 (27.8%)
3 144 (29.1%) 146 (31.8%) 146 (29.3%) 436 (30.0%)
4 81 (16.4%) 65 (14.2%) 85 (17.0%) 231 (15.9%)
5 68 (13.7%) 56 (12.2%) 74 (14.8%) 198 (13.6%)
-------------------------------------------------------------------------------------------
(collection DTable exported to file Statin_table.docx)
Plots by Outcome variable
use health_insurance
graph box age ,over(product) title("boxplot of Age by product") note( "Source : Data source unknown") ytitle("Fraction") ytitle("Density") name(graph1 , replace)
graph box absent ,over(product) title("boxplot of absentism by product") note( "Source : Data source unknown") ytitle("Fraction") ytitle("Density") name(graph2 , replace)
graph box household ,over(product) title("boxplot of households by product ") note( "Source : Data source unknown") ytitle("Fraction") ytitle("Density") name(graph3 , replace)
graph hbar,over(product) over(position_level) stack asyvars name(graph4 , replace) title("distribution of Position level")
graph hbar ,over(product) over(gender) stack asyvars name(graph5 , replace) title("distribution of Gender")
graph combine graph1 graph2 graph3 graph4 graph5 ,title("")
quietly graph export all_multii.svg, replace
Assess the nature of the product variable.
use health_insurance
tab product
group(produ |
ct) | Freq. Percent Cum.
------------+-----------------------------------
A | 495 34.07 34.07
B | 459 31.59 65.66
C | 499 34.34 100.00
------------+-----------------------------------
Total | 1,453 100.00
- The product variable has 3 levels which represents 3 product types (\(A,B ~and~ C\))
- Thus the variable is of nominal type because it represents distinct product types without an inherent order (i.e.,
A
is not objectively less thanB
andB
is not less thanC
).”
Model building
The most suitable model for this analysis is Multinomial logistic regression
Log Odds and Odds Ratios
- The logit functions (log odds) when our outcome is a Product variable with (A,B and C) categories and the independent variable is age is shown below
- the log odds for comparison between B and A is
\[g_1(x) = ln\frac{P(product=B|age)} {P(product=A|age)}=\alpha_1 + (\beta_{11}age)\]
- and, the log odds for comparison between C and A is
\[g_2(x) = ln\frac{P(Product=C|age)} {P(Product=A|age)}=\alpha_2 + (\beta_{21}age)\]
- Here the base category is
product A
hence all inferences are done with reference to it
use health_insurance
mlogit product age , baseoutcome(1)
mlogit product age , baseoutcome(1) rrr
Iteration 0: Log likelihood = -1595.2728
Iteration 1: Log likelihood = -1230.0067
Iteration 2: Log likelihood = -1175.2485
Iteration 3: Log likelihood = -1171.0613
Iteration 4: Log likelihood = -1171.0379
Iteration 5: Log likelihood = -1171.0379
Multinomial logistic regression Number of obs = 1,453
LR chi2(2) = 848.47
Prob > chi2 = 0.0000
Log likelihood = -1171.0379 Pseudo R2 = 0.2659
------------------------------------------------------------------------------
product | Coefficient Std. err. z P>|z| [95% conf. interval]
-------------+----------------------------------------------------------------
A | (base outcome)
-------------+----------------------------------------------------------------
B |
age | .1969927 .0122123 16.13 0.000 .173057 .2209283
_cons | -6.974368 .4166712 -16.74 0.000 -7.791029 -6.157708
-------------+----------------------------------------------------------------
C |
age | .2274424 .0125945 18.06 0.000 .2027576 .2521272
_cons | -8.323104 .4420157 -18.83 0.000 -9.189439 -7.456769
------------------------------------------------------------------------------
Iteration 0: Log likelihood = -1595.2728
Iteration 1: Log likelihood = -1230.0067
Iteration 2: Log likelihood = -1175.2485
Iteration 3: Log likelihood = -1171.0613
Iteration 4: Log likelihood = -1171.0379
Iteration 5: Log likelihood = -1171.0379
Multinomial logistic regression Number of obs = 1,453
LR chi2(2) = 848.47
Prob > chi2 = 0.0000
Log likelihood = -1171.0379 Pseudo R2 = 0.2659
------------------------------------------------------------------------------
product | RRR Std. err. z P>|z| [95% conf. interval]
-------------+----------------------------------------------------------------
A | (base outcome)
-------------+----------------------------------------------------------------
B |
age | 1.217735 .0148713 16.13 0.000 1.188934 1.247234
_cons | .0009356 .0003898 -16.74 0.000 .0004134 .0021171
-------------+----------------------------------------------------------------
C |
age | 1.255385 .015811 18.06 0.000 1.224776 1.28676
_cons | .0002428 .0001073 -18.83 0.000 .0001021 .0005775
------------------------------------------------------------------------------
Note: _cons estimates baseline relative risk for each outcome.
<-nnet::multinom(product~age, data=my_data) model
# weights: 9 (4 variable)
initial value 1596.283655
iter 10 value 1171.049631
final value 1171.037911
converged
tab_model(model)
group(product) | ||||
Predictors | Odds Ratios | CI | p | Response |
(Intercept) | 0.00 | 0.00 – 0.00 | <0.001 | B |
age | 1.22 | 1.19 – 1.25 | <0.001 | B |
(Intercept) | 0.00 | 0.00 – 0.00 | <0.001 | C |
age | 1.26 | 1.22 – 1.29 | <0.001 | C |
Observations | 1453 | |||
R2 / R2 adjusted | 0.266 / 0.265 |
thus for the simple model from the stata output we have two logit models:
\[ \begin{aligned} g_1(x) &= ln\frac{P(Product=B|age)} {P(Product=A|age)}=-6.974368 + (0.1969927age)\\ RRR_1&=\frac{P(Product=B|age)} {P(Product=A|age)}=e^{-6.974368 + (0.1969927age)} \end{aligned}\]
Interpretation of coefficients
- for one unit change in the variable age, the multinomial log-odds for choosing Product B relative to Product A will be increased by 0.1969927, age is also statistically significant at 5% level of significance.
Interpretation of exponenciated coefficients
- for one unit change in the variable age, the relative risk ratio for choosing Product B relative to product A will be increased by \(e^{0.1969927}=1.21773\), age is also statistically significant at 5% level of significance.
\[ \begin{aligned} g_2(x) &= ln\frac{P(Product=C|age)} {P(Product=A|age)}=-8.323104 + (0.2274424age)\\ RRR_2&= \frac{P(Product=C|age)} {P(Product=A|age)}=e^{-8.323104 + (0.2274424age)} \end{aligned} \]
- for one unit change in the variable age, the multinomial log-odds for choosing Product C relative to Product A will be increased by 0.2274424, age is also statistically significant at 5% level of significance.
Interpretation of exponenciated coefficients
- for one unit change in the variable age, the relative risk ratio for choosing Product C relative to product A will be increased by a factor of \(e^{0.2274424}=1.255385\), age is also statistically significant at 5% level of significance.
<-nnet::multinom(product~age+absent+household+gender+position_level, data=my_data) model2
# weights: 33 (20 variable)
initial value 1596.283655
iter 10 value 896.964516
iter 20 value 754.444237
iter 30 value 743.380590
final value 743.380575
converged
tab_model(model2)
group(product) | ||||
Predictors | Odds Ratios | CI | p | Response |
(Intercept) | 0.01 | 0.00 – 0.02 | <0.001 | B |
age | 1.28 | 1.24 – 1.32 | <0.001 | B |
absent | 1.01 | 0.99 – 1.04 | 0.297 | B |
household | 0.38 | 0.33 – 0.44 | <0.001 | B |
group(gender): Male | 0.09 | 0.06 – 0.15 | <0.001 | B |
group(gender): Non-binary | 1.27 | 0.12 – 13.77 | 0.843 | B |
group(position level): position level 2 |
0.82 | 0.44 – 1.50 | 0.515 | B |
group(position level): position level 3 |
0.51 | 0.27 – 0.93 | 0.028 | B |
group(position level): position level 4 |
0.22 | 0.10 – 0.49 | <0.001 | B |
group(position level): position level 5 |
0.24 | 0.11 – 0.54 | <0.001 | B |
(Intercept) | 0.00 | 0.00 – 0.00 | <0.001 | C |
age | 1.31 | 1.27 – 1.35 | <0.001 | C |
absent | 1.00 | 0.98 – 1.03 | 0.717 | C |
household | 1.23 | 1.12 – 1.36 | <0.001 | C |
group(gender): Male | 1.11 | 0.76 – 1.64 | 0.581 | C |
group(gender): Non-binary | 0.27 | 0.01 – 14.42 | 0.521 | C |
group(position level): position level 2 |
0.87 | 0.48 – 1.58 | 0.640 | C |
group(position level): position level 3 |
0.73 | 0.40 – 1.33 | 0.305 | C |
group(position level): position level 4 |
0.44 | 0.22 – 0.88 | 0.021 | C |
group(position level): position level 5 |
0.48 | 0.23 – 1.01 | 0.055 | C |
Observations | 1453 | |||
R2 / R2 adjusted | 0.534 / 0.533 |
use health_insurance
ssc install fitstat
mlogit product age absent household i.gender i.position_level , baseoutcome(1)
mlogit product age absent household i.gender i.position_level , baseoutcome(1) rrr
checking fitstat consistency and verifying not already installed...
all files already exist and are up to date.
Iteration 0: Log likelihood = -1595.2728
Iteration 1: Log likelihood = -774.56045
Iteration 2: Log likelihood = -745.09229
Iteration 3: Log likelihood = -743.38342
Iteration 4: Log likelihood = -743.38057
Iteration 5: Log likelihood = -743.38057
Multinomial logistic regression Number of obs = 1,453
LR chi2(18) = 1703.78
Prob > chi2 = 0.0000
Log likelihood = -743.38057 Pseudo R2 = 0.5340
------------------------------------------------------------------------------
product | Coefficient Std. err. z P>|z| [95% conf. interval]
-------------+----------------------------------------------------------------
A | (base outcome)
-------------+----------------------------------------------------------------
B |
age | .2449861 .0154906 15.82 0.000 .2146251 .2753472
absent | .0137522 .0131766 1.04 0.297 -.0120734 .0395778
household | -.9662861 .0698785 -13.83 0.000 -1.103245 -.8293269
|
gender |
Male | -2.369476 .2334001 -10.15 0.000 -2.826932 -1.91202
Non-binary | .240953 1.214742 0.20 0.843 -2.139897 2.621803
|
position_l~l |
2 | -.2033913 .3119391 -0.65 0.514 -.8147807 .4079982
3 | -.6830315 .3107755 -2.20 0.028 -1.29214 -.0739228
4 | -1.510145 .4024609 -3.75 0.000 -2.298954 -.7213359
5 | -1.407559 .4024606 -3.50 0.000 -2.196368 -.6187512
|
_cons | -5.207851 .5493343 -9.48 0.000 -6.284527 -4.131176
-------------+----------------------------------------------------------------
C |
age | .2706643 .0157149 17.22 0.000 .2398636 .3014651
absent | .0046137 .0127058 0.36 0.717 -.0202892 .0295166
household | .2077009 .0497547 4.17 0.000 .1101834 .3052184
|
gender |
Male | .1085684 .1964055 0.55 0.580 -.2763794 .4935162
Non-binary | -1.293302 2.021219 -0.64 0.522 -5.254818 2.668215
|
position_l~l |
2 | -.1427916 .3048914 -0.47 0.640 -.7403678 .4547845
3 | -.3147303 .306651 -1.03 0.305 -.9157553 .2862946
4 | -.8247833 .3575154 -2.31 0.021 -1.525501 -.124066
5 | -.730075 .3794344 -1.92 0.054 -1.473753 .0136028
|
_cons | -10.54573 .6452422 -16.34 0.000 -11.81038 -9.281076
------------------------------------------------------------------------------
Iteration 0: Log likelihood = -1595.2728
Iteration 1: Log likelihood = -774.56045
Iteration 2: Log likelihood = -745.09229
Iteration 3: Log likelihood = -743.38342
Iteration 4: Log likelihood = -743.38057
Iteration 5: Log likelihood = -743.38057
Multinomial logistic regression Number of obs = 1,453
LR chi2(18) = 1703.78
Prob > chi2 = 0.0000
Log likelihood = -743.38057 Pseudo R2 = 0.5340
------------------------------------------------------------------------------
product | RRR Std. err. z P>|z| [95% conf. interval]
-------------+----------------------------------------------------------------
A | (base outcome)
-------------+----------------------------------------------------------------
B |
age | 1.277604 .0197908 15.82 0.000 1.239397 1.316988
absent | 1.013847 .013359 1.04 0.297 .9879992 1.040371
household | .3804935 .0265883 -13.83 0.000 .3317925 .4363429
|
gender |
Male | .0935297 .0218298 -10.15 0.000 .0591942 .1477815
Non-binary | 1.272461 1.545712 0.20 0.843 .1176669 13.76052
|
position_l~l |
2 | .8159589 .2545295 -0.65 0.514 .4427364 1.503804
3 | .5050835 .1569676 -2.20 0.028 .2746823 .9287434
4 | .220878 .0888948 -3.75 0.000 .1003638 .4861024
5 | .2447399 .0984981 -3.50 0.000 .1112064 .5386167
|
_cons | .0054734 .0030067 -9.48 0.000 .0018649 .016064
-------------+----------------------------------------------------------------
C |
age | 1.310835 .0205997 17.22 0.000 1.271076 1.351838
absent | 1.004624 .0127646 0.36 0.717 .9799153 1.029957
household | 1.230845 .0612404 4.17 0.000 1.116483 1.356921
|
gender |
Male | 1.114681 .2189296 0.55 0.580 .7585251 1.638066
Non-binary | .2743634 .5545485 -0.64 0.522 .0052223 14.41421
|
position_l~l |
2 | .8669347 .2643209 -0.47 0.640 .4769385 1.575834
3 | .7299857 .2238508 -1.03 0.305 .4002142 1.331485
4 | .43833 .1567097 -2.31 0.021 .2175122 .8833216
5 | .4818729 .1828392 -1.92 0.054 .2290642 1.013696
|
_cons | .0000263 .000017 -16.34 0.000 7.43e-06 .0000932
------------------------------------------------------------------------------
Note: _cons estimates baseline relative risk for each outcome.
Product B relative to Product A
\[ \begin{aligned} ln\frac{P(Product=B|X_1,X_2,...,X_n)} {P(Product=A|X_1,X_2,...,X_n)}&=-5.207851 + (0.2449861age)+(0.0137522Absent)-.9662861household-2.36947Male-0.240953Nonbinary -0.2033913Position_2-0.6830315Position_3-1.510145Position_4-1.407559Position_5\\ \frac{P(Product=B|X_1,X_2,...,X_n)} {P(Product=A|X_1,X_2,...,X_n)}&= e^{-5.207851 + (0.2449861age)+(0.0137522Absent)-.9662861household-2.36947Male-0.240953Nonbinary -0.2033913Position_2-0.6830315Position_3-1.510145Position_4-1.407559Position_5\\} \end{aligned}\]
- for one unit change in the variable age, the relative risk ratio for choosing Product B relative to product A will be increased significantly by a factor of \(1.277604\) while adjusting for other covariates .Age is statistically significant \((p<0.05)\)
- for a 1 day increase in absentism from previous year, the relative risk ratio for choosing product B relative to product A increase by a factor of \(1.013847\) while adjusting for other variables, however absentism is not statistically significant \((p>0.05)\)
- If one person is added to the number of individuals in the employee’s household , the Relative Risk ratio of choosing Product B relative to product A decrease by a factor of \(.3804935\) while adjusting for other covariates in the model. Number of individuals in the household is also statistically significant at 5% level of significance.
- The relative risk ratio for choosing Product B relative to product A decrease by (1-.0935297):91% for males as compared to females while adjusting for all other covariates.Gender is a statistically significant predictor for choice of insurance product.\((p<0.05)\)
- The relative Risk ratio for choosing product B relative to product A increase as position level increase from position level 2 , 3, 4 and 5 relative to position 1 while holding all covariates constant. In essence this means that :
- decrease in RRR for Product 2 vs product 1 ->\(0.1840411\)
- decrease in RRR for Product 3 vs product 1 ->\(0.4949165^{***}\)
- decrease in RRR for Product 4 vs product 1 ->\(0.779122^{***}\)
- decrease in RRR for Product 5 vs product 1 ->\(0.7552601^{***}\)
here we not that the starred comparisons are the only significant ones at 5% level of significance.
Product C relative to Product A
\[ \begin{aligned} ln\frac{P(Product=C|X_1,X_2,...,X_n)} {P(Product=A|X_1,X_2,...,X_n)}&=-10.54573+(0.2706643age)+(0.0046137 Absent)+0.2077009household+0.1085684Male-1.293302Nonbinary -0.1427916Position_2-0.3147303Position_3-0.8247833Position_4-0.730075Position_5\\ \quad this~equates~to~RRR~as~follows~\\ \frac{P(Product=C|X_1,X_2,...,X_n)} {P(Product=A|X_1,X_2,...,X_n)}&=e^{-10.54573+(0.2706643age)+(0.0046137 Absent)+0.2077009household+0.1085684Male-1.293302Nonbinary -0.1427916Position_2-0.3147303Position_3-0.8247833Position_4-0.730075Position_5} \end{aligned}\]
- for one unit change in the variable age, the relative risk ratio for choosing Product C relative to product A will be increased significantly by a factor of 1.310835 while adjusting for other covariates .Age is statistically significant
- for a 1 day increase in absentism from previous year, the relative risk ratio for choosing product C relative to product A increase by a factor of 1.004624 while adjusting for other variables, however absentism is not statistically significant
- If one person is added to the number of individuals in the employee’s household , the Relative Risk ratio of choosing Product C relative to product A Increase by a factor of 1.230845 while adjusting for other covariates in the model. Number of individuals in the household is also statistically significant at 5% level of significance.
- The relative risk ratio for choosing Product C relative to product A Increase by (1.11468-1):11.5% for males as compared to females while adjusting for all other covariates. Gender is However not a statistically significant predictor for choice of insurance product here.
- The relative Risk ratio for choosing product C relative to product A decrease as position level decrease from position level 2 , 3, 4 and 5 relative to position 1 while holding all covariates constant. In essence this means that :
- decrease in RRR for Position 2 vs Position 1 ->\(0.1330653\)
- decrease in RRR for Position 3 vs Position 1 ->\(0.2700143\)
- decrease in RRR for Position 4 vs Position 1 ->\(0.56167^{***}\)
- decrease in RRR for Position 5 vs Position 1 ->\(0.5181271\)
here we not that the starred comparisons are the only significant ones at 5% level of significance.
Evaluate and compare the overall fit
Null model vs Age model
use health_insurance
quietly mlogit product, baseoutcome(1) rrr
est store model1
quietly mlogit product age , baseoutcome(1) rrr
est store model2
ratio test**
**Likelihood
stats(bic aic)
estout model1 model2 ,
lrtest model1 model2
model1 model2
b b
--------------------------------------
A
age 0
_cons 0 0
--------------------------------------
B
age .1969927
_cons -.0755076 -6.974368
--------------------------------------
C
age .2274424
_cons .0080483 -8.323104
--------------------------------------
bic 3205.108 2371.201
aic 3194.546 2350.076
--------------------------------------
Likelihood-ratio test
Assumption: model1 nested within model2
LR chi2(2) = 848.47
Prob > chi2 = 0.0000
- For the model with age , AIC and BIC reduced significantly indicating that the model with age performs better than a null model.
- Through the likelihood ratio test we see that the model with age performs better than the null model (model with no covariates) at 5% level of significance (\(p<0.001\))
Full model vs null model
use health_insurance
quietly mlogit product , baseoutcome(1) rrr
est store model1
quietly mlogit product age absent household i.gender i.position_level , baseoutcome(1) rrr
est store model2
ratio test**
**Likelihood
stats(bic aic)
estout model1 model2 ,
lrtest model1 model2
model1 model2
b b
--------------------------------------
A
age 0
absent 0
household 0
1.gender 0
2.gender 0
3.gender 0
1.position~l 0
2.position~l 0
3.position~l 0
4.position~l 0
5.position~l 0
_cons 0 0
--------------------------------------
B
age .2449861
absent .0137522
household -.9662861
1.gender 0
2.gender -2.369476
3.gender .240953
1.position~l 0
2.position~l -.2033913
3.position~l -.6830315
4.position~l -1.510145
5.position~l -1.407559
_cons -.0755076 -5.207851
--------------------------------------
C
age .2706643
absent .0046137
household .2077009
1.gender 0
2.gender .1085684
3.gender -1.293302
1.position~l 0
2.position~l -.1427916
3.position~l -.3147303
4.position~l -.8247833
5.position~l -.730075
_cons .0080483 -10.54573
--------------------------------------
bic 3205.108 1632.389
aic 3194.546 1526.761
--------------------------------------
Likelihood-ratio test
Assumption: model1 nested within model2
LR chi2(18) = 1703.78
Prob > chi2 = 0.0000
- For the model with all variables , AIC and BIC reduced significantly indicating that the full model performs better than a null model.
- Through the likelihood ratio test we see that the model with age performs better than the null model (model with no covariates) at 5% level of significance (\(p<0.001\))
Justify your selection
use health_insurance
quietly mlogit product age, baseoutcome(1) rrr
est store model1
quietly mlogit product age absent household i.gender i.position_level , baseoutcome(1) rrr
est store model2
ratio test**
**Likelihood
lrtest model1 model2
Likelihood-ratio test
Assumption: model1 nested within model2
LR chi2(16) = 855.31
Prob > chi2 = 0.0000
- The results of the likelihood ratio tests can be used to ascertain the significance of predictors to the model. This table tells us that addition of more variables to age had significant main effects on Product choice,
\[\chi^2_{(16)} = 855.31, p = .000\]
use health_insurance
for model with age alone*
*Measures quietly mlogit product age, baseoutcome(1) rrr
save
fitstat ,
*Model with additional variablesquietly mlogit product age absent household i.gender i.position_level , baseoutcome(1) rrr
fitstat, dif
Measures of Fit for mlogit of product
Log-Lik Intercept Only: -1595.273 Log-Lik Full Model: -1171.038
D(1447): 2342.076 LR(2): 848.470
Prob > LR: 0.000
McFadden's R2: 0.266 McFadden's Adj R2: 0.262
Maximum Likelihood R2: 0.442 Cragg & Uhler's R2: 0.498
Count R2: 0.509 Adj Count R2: 0.255
AIC: 1.620 AIC*n: 2354.076
BIC: -8194.089 BIC': -833.907
(Indices saved in matrix fs_0)
Measures of Fit for mlogit of product
Log-Lik Intercept Only: -1595.273 Log-Lik Full Model: -743.381
D(1417): 1486.761 LR(18): 1703.784
Prob > LR: 0.000
McFadden's R2: 0.534 McFadden's Adj R2: 0.511
Maximum Likelihood R2: 0.690 Cragg & Uhler's R2: 0.777
Count R2: 0.566 Adj Count R2: 0.341
AIC: 1.073 AIC*n: 1558.761
BIC: -8830.962 BIC': -1572.719
- The table below and the one above show results for a full model and the one with a single predictor age respectively.
- We see that on fitting the full model the following significant changes which show model improvement were noted:
- log likelihood increased from \(-1171.038~ to ~-743.381\)
- Akaike information criterion decreased from \(2354.076 ~to~ 1558.761\)
- Bayesian Information criterion decreased from \(-833.907 ~to~ -1572.719\)
all these results suggest that adding more variables to the model improved model fit. Hence the model with all variables performs better and the best than the model with AGE alone.