Amoud University

Abstract

This primer provides an overview of 35 different Bayesian regression models, including linear, generalized linear, and other regression models, using reproducible R , JAGS, BUGS, INLA and Stan. The models are categorized based on the nature and scale of measurement of the dependent variable, including continuous variables, counts, nominal (binary and multinomial), and ordinal.

For each model, we provide a definition, assumptions, use, and real-life applications in research. We also present R code examples to illustrate the model fitting and evaluation process, including data preparation, model selection, model execution, and result interpretation.

In the continuous variable category, we cover models such as the linear regression, Bayesian ridge regression, and quantile and LASSo regressions, which are used to model continuous outcomes. In the count category, we cover models such as the Poisson and negative binomial regression, which are used to model count data.

In the nominal category, we cover models such as the logistic and multinomial logistic regression, which are used to model binary and multinomial outcomes. In the ordinal category, we cover models such as the ordered logistic and probit regression, which are used to model ordinal outcomes.

The primer emphasizes the importance of Bayesian regression models in capturing uncertainty and providing more realistic estimates of model parameters. We also discuss the assumptions and limitations of each model and provide guidance on how to choose the appropriate model based on the research question and data type.

Overall, this primer provides a practical guide to Bayesian regression models using R, JAGS, BUGS, INLA and Stan, suitable for researchers, applied statisticians and data analysts at all levels. The primer covers a wide range of models and provides R code examples that can be easily adapted to suit individual research needs.

Introduction

Regression analysis is a fundamental statistical technique used to model the relationship between a dependent variable and one or more independent variables. It is widely used in various fields, such as social sciences, engineering, economics, and health sciences, to analyze data and make predictions.

Bayesian regression is a statistical approach that combines prior knowledge with observed data to make inferences about model parameters. It is a powerful tool that provides more realistic estimates of uncertainty and can handle complex data structures and nonlinear relationships.

This primer provides a comprehensive overview of 35 different Bayesian regression models using reproducible R software, INLA, JAGS, BUGS, and Stan. The models are categorized based on the nature and scale of measurement of the dependent variable, including continuous variables, counts, nominal (binary and multinomial), and ordinalscales.

For each model, we provide a definition, assumptions, use, and real-life applications in research. We also present R code examples to illustrate the model fitting and evaluation process, including data preparation, model selection, model execution, and result interpretation.

The primer emphasizes the importance of Bayesian regression models in capturing uncertainty and providing more realistic estimates of model parameters. We also discuss the assumptions and limitations of each model and provide guidance on how to choose the appropriate model based on the research question and data type.

Overall, this primer is a practical guide to Bayesian regression models using R, JAGS, BUGS, INLA and Stan, suitable for researchers, applied statisticians and data analysts at all levels. It covers a wide range of models and provides R code examples that can be easily adapted to suit individual research needs.

Bayesian Methods

Bayesian regression models are based on the principles of Bayesian statistics, which involves updating prior beliefs or knowledge about a model parameter based on observed data. The Bayesian approach to regression modeling allows for the incorporation of prior knowledge or information about the model parameters, which can help to improve the accuracy of the model estimates.

Bayesian regression models are characterized by the following components:

Prior distribution:

This is the distribution of the model parameters before any data is observed. The choice of prior distribution can have a significant impact on the posterior distribution and the resulting inferences.

Likelihood function:

This is the probability of observing the data given the model parameters. The likelihood function is used to update the prior distribution to the posterior distribution.

Posterior distribution:

This is the distribution of the model parameters after observing the data. It is obtained by combining the prior distribution and the likelihood function.

Markov Chain Monte Carlo (MCMC) methods:

These are techniques used to sample from the posterior distribution. One popular MCMC algorithm is Hamiltonian Monte Carlo (HMC), which is implemented in the Stan software.

Bayesian regression models have several advantages over traditional frequentist regression models. For instance, Bayesian models provide more realistic estimates of uncertainty, can handle complex data structures and nonlinear relationships, and allow for the incorporation of prior knowledge or information about the model parameters.

In the next sections, we will cover a range of Bayesian regression models, including linear, generalized linear, and other alternative regression models, and demonstrate how to implement them using R, JAGS, BUGS, INLA and Stan.

Prior Distribution and Selection Process

The choice of prior distribution can have a significant impact on the posterior distribution and the resulting inferences. Prior distributions can be chosen based on expert knowledge, previous research, or empirical data.

There are two main types of prior distributions: informative and non-informative. Informative priors incorporate specific knowledge or beliefs about the model parameters, while non-informative priors are chosen to be vague and allow the data to have the greatest influence on the posterior distribution.

The selection of a prior distribution depends on the available knowledge and the research question. In situations where there is little prior knowledge about the model parameters, non-informative priors are often used. However, in situations where there is significant prior knowledge or there are strong theoretical reasons to believe in a particular prior distribution, informative priors may be more appropriate.

Prior selection can be a challenging and subjective process, and it is important to assess the sensitivity of the results to different prior distributions. Sensitivity analysis involves examining the posterior distribution and inferences obtained using different prior distributions to assess the robustness of the results to prior specification.

Likelihood Function

The likelihood function is a fundamental component of Bayesian regression models. It is the probability of observing the data given the model parameters and is used to update the prior distribution to the posterior distribution.

The choice of likelihood function depends on the nature and scale of measurement of the dependent variable. For continuous variables, the likelihood function is typically a normal distribution. For count data, the likelihood function may be a Poisson or negative binomial distribution. For binary data, the likelihood function may be a Bernoulli or binomial distribution. For ordinal data, the likelihood function may be an ordered logistic or probit distribution.

In some cases, the likelihood function may need to be customized to fit the specific characteristics of the data. For example, if the data exhibit heteroscedasticity (varying levels of variance across the range of the dependent variable), a variance function may be incorporated into the likelihood function.

The likelihood function is a key component of Bayesian regression models and plays a crucial role in the estimation of the posterior distribution. In the next sections, we will provide examples of how to specify likelihood functions for different Bayesian regression models and demonstrate how to fit the models using R and Stan.

Posterior Distribution

The posterior distribution is the distribution of the model parameters after observing the data and updating the prior distribution using the likelihood function. It is obtained by combining the prior distribution and the likelihood function.

The posterior distribution provides information about the uncertainty of the model parameters and is used to make inferences about the parameters of interest. The posterior distribution can be summarized using various statistics, such as the mean, median, standard deviation, and credible intervals.

The posterior distribution can be estimated using Markov Chain Monte Carlo (MCMC) methods, which involve simulating a large number of parameter values from the posterior distribution. One popular MCMC algorithm is Hamiltonian Monte Carlo (HMC), which is implemented in the Stan software.

Assessing the goodness of fit of the model is an important step in Bayesian regression analysis. This involves comparing the observed data to the data generated from the posterior distribution. If the model fits the data well, the observed data should be similar to the simulated data.

Markov Chain Monte Carlo (MCMC) Methods

Markov Chain Monte Carlo (MCMC) methods are used to simulate a large number of parameter values from the posterior distribution. MCMC methods are based on the principle of Markov chains, which are stochastic processes where the probability of moving from one state to another depends only on the current state and not on any previous states.

The most common MCMC algorithm used in Bayesian regression analysis is Hamiltonian Monte Carlo (HMC). HMC is a more efficient algorithm than other MCMC methods because it uses gradient information to propose new parameter values, resulting in fewer rejected proposals and faster convergence to the posterior distribution.

In Bayesian regression analysis, MCMC methods are used to estimate the posterior distribution and generate samples of the posterior distribution. The samples can be used to calculate summary statistics, such as the mean, median, standard deviation, and credible intervals, and to make inferences about the model parameters.

It is important to assess the convergence of the MCMC algorithm to ensure that the posterior distribution has been accurately estimated. This can be done by examining trace plots, which show the path of the MCMC algorithm through the parameter space, and by calculating diagnostic statistics, such as the effective sample size and the Gelman-Rubin statistic.

Algorithm of Markov Chain Monte Carlo (MCMC) Methods

The algorithm for MCMC methods involves the following steps:

  1. Choose an initial value for the model parameters.

  2. Generate a proposal for new parameter values using a proposal distribution. The proposal distribution can be chosen based on the characteristics of the data and the model.

  3. Calculate the acceptance probability for the proposed parameter values using the ratio of the posterior probability for the proposed values to the posterior probability for the current values.

  4. Accept the proposal with probability equal to the acceptance probability, otherwise reject the proposal and retain the current parameter values.

  5. Repeat steps 2-4 for a large number of iterations.

Use the samples obtained from the MCMC algorithm to estimate the posterior distribution and make inferences about the model parameters.

In Hamiltonian Monte Carlo (HMC), the proposal distribution is based on the gradient of the log posterior distribution and simulates a trajectory through the parameter space. The HMC algorithm is more efficient than other MCMC methods because it can explore the parameter space more effectively and generates fewer rejected proposals.

It is important to tune the parameters of the MCMC algorithm to ensure efficient mixing and convergence to the posterior distribution. The parameters that need to be tuned include the number of iterations, the burn-in period, the thinning interval, and the proposal distribution.

Convergence of MCMC Methods

Convergence is a critical aspect of MCMC methods in Bayesian regression analysis. Convergence refers to the property that the MCMC algorithm has reached a stable state and is generating samples from the target distribution.

Convergence is assessed by examining the trace plots of the MCMC chains, which show the path of the algorithm through the parameter space. Ideally, the trace plot should show a dense and random scatter of points throughout the parameter space. If the trace plot shows a pattern or a lack of randomness, it may indicate that the algorithm has not yet converged.

There are several diagnostic statistics that can be used to assess convergence, including the effective sample size and the Gelman-Rubin statistic.

  1. The effective sample size is a measure of the number of independent samples obtained from the MCMC chain and is used to assess the efficiency of the algorithm.

  2. The Gelman-Rubin statistic compares the variance between different MCMC chains to the variance within each chain and is used to assess the convergence of the algorithm.

If the MCMC algorithm has not converged, it may be necessary to increase the number of iterations or the burn-in period, adjust the proposal distribution, or use a different MCMC algorithm.

In Bayesian regression analysis, it is important to ensure that the MCMC algorithm has converged to the posterior distribution before making inferences about the model parameters. In the next sections, we will demonstrate how to assess convergence of the MCMC algorithm and how to use diagnostic statistics to ensure that the algorithm has converged to the posterior distribution.

Convergence Diagnostic Tests

Convergence diagnostic tests are used to assess whether the Markov Chain Monte Carlo (MCMC) algorithm has converged to the posterior distribution. There are several diagnostic tests that can be used to assess convergence, including:

  1. Trace plots: Trace plots show the path of the MCMC algorithm through the parameter space. If the trace plot shows a dense and random scatter of points throughout the parameter space, it indicates convergence. If the trace plot shows a pattern or lack of randomness, it indicates that the algorithm has not yet converged.

  2. Autocorrelation plots: Autocorrelation plots show the correlation between the values of the MCMC chain at different lags. If the autocorrelation decreases rapidly as the lag increases, it indicates convergence. If the autocorrelation remains high at large lags, it indicates that the algorithm has not yet converged.

  3. Effective sample size: The effective sample size is a measure of the number of independent samples obtained from the MCMC chain. A low effective sample size indicates that the MCMC algorithm is not mixing well and may not have converged.

  4. Gelman-Rubin statistic: The Gelman-Rubin statistic compares the variance between different MCMC chains to the variance within each chain. If the Gelman-Rubin statistic is close to 1, it indicates convergence.

It is important to assess convergence before making inferences about the model parameters. If the MCMC algorithm has not converged, it may be necessary to increase the number of iterations or the burn-in period, adjust the proposal distribution, or use a different MCMC algorithm.

Metrics for Convergence Checking

There are several metrics that can be used to check for convergence of Markov Chain Monte Carlo (MCMC) methods in Bayesian regression analysis. These metrics are used to assess the mixing and stability of the MCMC chains and to ensure that the samples obtained from the chains are representative of the posterior distribution. Some commonly used metrics for convergence checking are:

  1. Effective Sample Size (ESS): The effective sample size is a measure of the number of independent samples obtained from the MCMC chain. A low ESS indicates that the MCMC algorithm is not mixing well and may not have converged. A rule of thumb is to have an ESS of at least 100 for each parameter.

  2. Rhat (Gelman-Rubin statistic): The Rhat statistic compares the variance between different MCMC chains to the variance within each chain. If the Rhat statistic is close to 1, it indicates convergence. A commonly used threshold for convergence is Rhat < 1.1.

  3. Trace plots and autocorrelation plots: Trace plots show the path of the MCMC algorithm through the parameter space, and autocorrelation plots show the correlation between the values of the MCMC chain at different lags. If the trace plot shows a dense and random scatter of points throughout the parameter space, and the autocorrelation decreases rapidly as the lag increases, it indicates convergence.

  4. Geweke diagnostic: The Geweke diagnostic compares the mean of the early and late portions of the MCMC chain to assess whether they are from the same distribution. If the p-value of the Geweke diagnostic is greater than 0.01, it indicates convergence.

  5. Heidelberg and Welch diagnostic: The Heidelberg and Welch diagnostic compares the variance of the first and second halves of the MCMC chain to assess whether they are from the same distribution. If the p-value of the Heidelberg and Welch diagnostic is greater than 0.01, it indicates convergence.

It is important to use multiple metrics for convergence checking to ensure that the MCMC algorithm has accurately estimated the posterior distribution. In the next sections, we will demonstrate how to use these convergence metrics to check for convergence of the MCMC algorithm in Bayesian regression analysis.

Data Analysis

There are four types of data analysis: descriptive, diagnostic, predictive, and prescriptive analytics. Here’s a brief explanation of each:

Descriptive analytics: This type of analysis focuses on summarizing and describing historical data to gain insights into past trends and events. Descriptive analytics involves collecting, organizing, and presenting data in a meaningful way that allows for easy interpretation and understanding. Examples of descriptive analytics include dashboards, charts, graphs, and tables.

Diagnostic analytics: Diagnostic analytics goes a step further than descriptive analytics by attempting to understand why certain events or patterns occurred in the past. This type of analysis involves digging deeper into data to uncover the root causes of specific outcomes or events. Diagnostic analytics is often used to identify problems or areas of improvement in a business or system.

Predictive analytics: Predictive analytics uses statistical models and algorithms to analyze historical data and make predictions about future events or outcomes. This type of analysis involves identifying patterns and trends in data and using that information to forecast future trends and behaviors. Predictive analytics is often used in industries such as finance, marketing, and healthcare to make informed decisions about future outcomes.

Prescriptive analytics: Prescriptive analytics takes predictive analytics a step further by providing actionable recommendations based on the predictions made by the model. This type of analysis involves using optimization algorithms and other techniques to identify the best course of action to achieve a specific goal. Prescriptive analytics is often used in fields such as healthcare, manufacturing, and logistics to optimize processes and make informed decisions.

Descriptive and Inferential Statistics

Descriptive and inferential statistics are two different branches of statistics that serve different purposes. Here is an illustration to explain the difference between the two:

Suppose we want to study the heights of students in a school.

Descriptive statistics is used to summarize and describe the data that we collected. We could calculate the mean, median, mode, and standard deviation of the heights of all the students in the school. We could also create graphs or charts to visualize the distribution of the heights. Descriptive statistics is useful for providing an overview of the data and identifying patterns and trends.

Inferential statistics, on the other hand, is used to make generalizations or predictions about a larger population based on a sample of data. If we only have data on a sample of students from the school, we could use inferential statistics to estimate the mean height of all students in the school. We could calculate a confidence interval, which is a range of values that we are confident contains the true population mean height. We could also conduct hypothesis testing to determine whether there is a significant difference in height between different groups of students, such as males and females. Inferential statistics allows us to draw conclusions about a population based on a sample of data, but it requires assumptions and statistical tests to ensure accuracy.

In summary, descriptive statistics is used to summarize and describe data, while inferential statistics is used to make generalizations or predictions about a larger population based on a sample of data.

Two Philosophical Schools of Inferential Statistics

The Bayesian and frequentist approaches are two different philosophies for statistical inference. Here’s an illustration to explain the difference between the two:

Suppose we want to estimate the probability of a coin landing heads up. We flip the coin 10 times and observe that it lands heads up 6 times.

The frequentist approach would calculate the probability of getting 6 or more heads in 10 coin flips assuming that the true probability of heads is 0.5, which is the null hypothesis. The p-value is the probability of getting a result as extreme or more extreme than the observed result, assuming the null hypothesis is true. A p-value of less than 0.05 is typically considered evidence to reject the null hypothesis and conclude that the true probability of heads is not 0.5.

The Bayesian approach, on the other hand, would start with a prior belief about the probability of heads, which could be based on previous data or subjective knowledge. The prior belief is updated based on the new data using Bayes’ theorem to obtain a posterior distribution of the probability of heads. This posterior distribution can then be used to make probabilistic statements about the true probability of heads.

For example, if the prior belief was that the probability of heads is equally likely to be any value between 0 and 1, the posterior distribution after observing 6 heads in 10 flips would be a beta distribution with parameters (7, 5), which has a mode of 0.58. This means that the Bayesian approach estimates that the true probability of heads is most likely around 0.58, but also considers other values of probability based on the shape of the posterior distribution.

In summary, the frequentist approach focuses on the probability of observing the data given the null hypothesis, while the Bayesian approach updates prior beliefs to obtain a posterior distribution of the parameter of interest.

Inductive and Deductive Inferential Approaches

Inductive and deductive approaches are two different types of inferential reasoning used in statistical inference. Here are some examples to illustrate the difference between the two:

Inductive Inferential Approach: Inductive reasoning involves drawing general conclusions from specific observations or examples. In statistical inference, inductive reasoning is used to make generalizations about a population based on a sample of data. For example, suppose we survey a random sample of 500 college students and find that 60% of them own a car. We could use inductive reasoning to generalize this result to the entire population of college students in a given country or region, assuming that the sample is representative of the population.

Deductive Inferential Approach: Deductive reasoning involves starting with a general principle or theory and using it to draw specific conclusions. In statistical inference, deductive reasoning is used to test hypotheses and make predictions based on existing theories or models. For example, suppose we have a theory that taller people are more likely to develop certain health problems. We could use deductive reasoning to test this theory by collecting data on the heights and health statuses of a sample of people, and then using statistical tests to determine whether there is a significant relationship between height and health.

In summary, inductive reasoning is used to draw general conclusions from specific observations, while deductive reasoning is used to test hypotheses and make predictions based on existing theories or models. In statistical inference, both approaches are used to draw conclusions about populations based on samples of data or to test hypotheses and make predictions about relationships between variables.

Module I: Regression Models for Continuous Variable Responses

Introduction

Regression models are statistical techniques that are used to model the relationship between a dependent variable and one or more independent variables. In this introduction, we will discuss several regression models that are commonly used for continuous variable responses.

  1. Multiple linear regression model
  2. Polynomial regression model
  3. Ridge regression model
  4. LASSO regression model
  5. Robust regression model
  6. Quantile regression model
  7. Elastic net regression model
  8. Support vector regression model
  9. Principal component regression model
  10. Partial least squares regression model

Multiple Linear Regression Model

Introduction

The first model we mentioned in the previous response is multiple linear regression. Multiple linear regression is a commonly used statistical technique for modeling the relationship between a dependent variable and multiple independent variables. It is a linear model that assumes a linear and additive relationship between the dependent variable and each independent variable.

Model Formulation

The multiple linear regression model can be expressed as:

\[y = \beta_0 + \beta_1x_1 + \beta_2x_2 + \cdots + \beta_px_p + \epsilon\]

where \(y\) is the dependent variable, \(x_1, x_2, \ldots, x_p\) are the independent variables, \(\beta_0, \beta_1, \ldots, \beta_p\) are the regression coefficients, and \(\epsilon\) is the error term.

The regression coefficients \(\beta_0, \beta_1, \ldots, \beta_p\) represent the change in the dependent variable \(y\) associated with a one-unit change in the corresponding independent variable \(x_1, x_2, \ldots, x_p\). The error term \(\epsilon\) represents the unobserved factors that affect the dependent variable and are not explained by the independent variables.

The goal of multiple linear regression is to estimate the regression coefficients that best fit the data. This is typically done using the method of least squares, which minimizes the sum of squared errors between the observed values of the dependent variable and the predicted values from the regression model.

Multiple linear regression is a versatile technique that can be used for a wide range of research questions. It is commonly used in social sciences, economics, engineering, and many other fields to model the relationships between variables. However, it does have some assumptions that must be met, such as linearity, independence, and normality of the errors. If these assumptions are violated, alternative regression models, such as the ones discussed in the previous response, may be more appropriate.

Model Assumptions

  • The assumptions of the MLR model include:

before using MLR, it is important to check whether the assumptions of the model are satisfied. Here are the assumptions of MLR:

  1. Linearity: The relationship between the dependent variable and the independent variables should be linear. This means that the relationship between the dependent variable and each independent variable should be a straight line in a scatter plot.

  2. Independence: The observations should be independent of each other. This means that the value of the dependent variable for one observation should not be related to the value of the dependent variable for another observation.

  3. Homoscedasticity: The variance of the errors should be constant across all levels of the independent variables. This means that the spread of the residuals should be the same regardless of the value of the independent variables.

  4. Normality: The errors should be normally distributed. This means that the distribution of the residuals should be a bell-shaped curve with a mean of zero.

  5. No multicollinearity: The independent variables should not be highly correlated with each other. Multicollinearity can affect the stability and reliability of the estimates of the regression coefficients.

  6. No influential outliers: The presence of influential outliers can distort the regression line and affect the estimates of the regression coefficients. Therefore, it is important to check for influential outliers using diagnostic plots and outlier detection techniques.

It is important to note that violating these assumptions can lead to biased or inefficient estimates of the regression coefficients and affect the statistical inference and predictions. Therefore, it is recommended to check these assumptions before interpreting the results of an MLR model.

Alternatives to Multiple Linear Regression Model

  1. Polynomial regression model: Polynomial regression is used when the relationship between the dependent variable and the independent variable is not linear but can be approximated by a polynomial function. Polynomial regression models become an alternative to multiple linear regression models when the relationship between the dependent and independent variables is non-linear. Polynomial regression models can be used to capture non-linear relationships between variables that multiple linear regression models cannot.

  2. Ridge regression model: Ridge regression is used when the multiple linear regression model suffers from multicollinearity, which is when two or more independent variables are highly correlated. Ridge regression models become an alternative to multiple linear regression models when multicollinearity is present. Ridge regression models can help reduce the impact of multicollinearity by adding a penalty term to the regression equation.

  3. LASSO regression model: LASSO regression is also used when the multiple linear regression model suffers from multicollinearity, but it differs from ridge regression in that it uses a different penalty term that can result in some of the coefficients being reduced to zero. LASSO regression models become an alternative to multiple linear regression models when multicollinearity is present, and the researcher is interested in identifying the most important predictors.

  4. Robust regression model: Robust regression is used when there are outliers in the data that can have a significant impact on the results of the multiple linear regression model. Robust regression models become an alternative to multiple linear regression models when there are outliers present in the data. Robust regression models are less influenced by outliers than multiple linear regression models.

  5. Quantile regression model: Quantile regression is used when the relationship between the dependent and independent variables is not the same across all levels of the dependent variable. Quantile regression models become an alternative to multiple linear regression models when the relationship between the dependent and independent variables varies across different levels of the dependent variable.

  6. Elastic net regression model: Elastic net regression is used when there are many independent variables, and some of them are highly correlated. Elastic net regression models become an alternative to multiple linear regression models when there are many independent variables and some of them are highly correlated. Elastic net regression models combine the penalty terms of ridge regression and LASSO regression to help reduce the impact of multicollinearity and identify the most important predictors.

  7. Support vector regression model: Support vector regression is used when the relationship between the dependent and independent variables is non-linear and cannot be approximated by a polynomial function. Support vector regression models become an alternative to multiple linear regression models when the relationship between the dependent and independent variables is non-linear and cannot be approximated by a polynomial function. Support vector regression models use a kernel function to transform the data into a higher-dimensional space where a linear relationship can be found.

  8. Principal component regression model: Principal component regression is used when there are many independent variables that are highly correlated and can be combined into a smaller number of uncorrelated variables. Principal component regression models become an alternative to multiple linear regression models when there are many independent variables that are highly correlated and can be combined into a smaller number of uncorrelated variables. Principal component regression models use principal component analysis to combine the correlated variables into a smaller number of uncorrelated variables.

  9. Partial least squares regression model: Partial least squares regression is used when there are many independent variables that are highly correlated and can be combined into a smaller number of uncorrelated variables, but the relationship between the dependent and independent variables is not linear. Partial least squares regression models become an alternative to multiple linear regression models when there are many independent variables that are highly correlated and can be combined into a smaller number of uncorrelated variables, but the relationship between the dependent and independent variables is not linear. Partial least squares regression models use partial least squares regression analysis to combine the correlated variables into a smaller number of uncorrelated variables and find a non-linear relationship between the dependent and independent variables.

Real-life Application

We will use the “Boston Housing” dataset from the MASS package, which contains information about housing values in suburbs of Boston. The goal is to predict the median value of owner-occupied homes (in thousands of dollars) based on 13 predictor variables, including crime rate, proportion of non-retail business acres, and average number of rooms per dwelling.

# Load the MASS package and the Boston Housing dataset
library(MASS)
data(Boston)

# Define the response variable (median value of owner-occupied homes)
y <- Boston$medv

# Define the predictor variables
x <- Boston[, -14]

# Split the data into training and testing sets
set.seed(123)
train_index <- sample(nrow(Boston), nrow(Boston) * 0.7)
train_data <- Boston[train_index, ]
test_data <- Boston[-train_index, ]

Frequentist Inference for all Models

# 1. Fit a multiple linear regression model
mlr_model <- lm(medv ~ ., data = train_data)
summary(mlr_model)
## 
## Call:
## lm(formula = medv ~ ., data = train_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.5163  -2.6745  -0.5699   1.5818  24.7767 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.852e+01  6.084e+00   6.332 7.66e-10 ***
## crim        -1.090e-01  3.539e-02  -3.079 0.002245 ** 
## zn           5.303e-02  1.668e-02   3.179 0.001614 ** 
## indus       -5.224e-02  7.877e-02  -0.663 0.507669    
## chas         4.044e+00  1.025e+00   3.946 9.64e-05 ***
## nox         -1.443e+01  4.671e+00  -3.089 0.002171 ** 
## rm           3.178e+00  4.993e-01   6.365 6.32e-10 ***
## age         -5.659e-04  1.618e-02  -0.035 0.972128    
## dis         -1.541e+00  2.405e-01  -6.406 4.98e-10 ***
## rad          3.023e-01  8.064e-02   3.749 0.000209 ***
## tax         -1.049e-02  4.658e-03  -2.252 0.024963 *  
## ptratio     -8.587e-01  1.599e-01  -5.370 1.46e-07 ***
## black        6.865e-03  3.443e-03   1.994 0.046977 *  
## lstat       -5.838e-01  5.915e-02  -9.871  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.787 on 340 degrees of freedom
## Multiple R-squared:  0.733,  Adjusted R-squared:  0.7228 
## F-statistic:  71.8 on 13 and 340 DF,  p-value: < 2.2e-16
# 2. Fit a polynomial regression model
poly_model <- lm(medv ~ poly(lstat, 2) + poly(rm, 2), data = train_data)
summary(poly_model)
## 
## Call:
## lm(formula = medv ~ poly(lstat, 2) + poly(rm, 2), data = train_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.8973  -2.7986  -0.5529   2.0765  27.4996 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      22.3011     0.2368  94.166  < 2e-16 ***
## poly(lstat, 2)1 -98.8748     5.5940 -17.675  < 2e-16 ***
## poly(lstat, 2)2  22.3699     5.0154   4.460 1.11e-05 ***
## poly(rm, 2)1     51.6293     5.6290   9.172  < 2e-16 ***
## poly(rm, 2)2     51.3146     4.9761  10.312  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.456 on 349 degrees of freedom
## Multiple R-squared:  0.7626, Adjusted R-squared:  0.7598 
## F-statistic: 280.2 on 4 and 349 DF,  p-value: < 2.2e-16
# 3. Fit a ridge regression model
ridge_model <- MASS::lm.ridge(medv ~ ., data = train_data, lambda = 0.1)
summary(ridge_model)
##        Length Class  Mode   
## coef   13     -none- numeric
## scales 13     -none- numeric
## Inter   1     -none- numeric
## lambda  1     -none- numeric
## ym      1     -none- numeric
## xm     13     -none- numeric
## GCV     1     -none- numeric
## kHKB    1     -none- numeric
## kLW     1     -none- numeric
# 4. Fit a LASSO regression model
library(glmnet)
## Warning: package 'glmnet' was built under R version 4.1.3
## Loading required package: Matrix
## Warning: package 'Matrix' was built under R version 4.1.3
## Loaded glmnet 4.1-4
lasso_model <- glmnet(x = as.matrix(train_data[, -14]), y = train_data$medv, alpha = 1)
summary(lasso_model)
##           Length Class     Mode   
## a0         74    -none-    numeric
## beta      962    dgCMatrix S4     
## df         74    -none-    numeric
## dim         2    -none-    numeric
## lambda     74    -none-    numeric
## dev.ratio  74    -none-    numeric
## nulldev     1    -none-    numeric
## npasses     1    -none-    numeric
## jerr        1    -none-    numeric
## offset      1    -none-    logical
## call        4    -none-    call   
## nobs        1    -none-    numeric
# 5. Fit a robust regression model
library(MASS)
robust_model <- rlm(medv ~ ., data = train_data, psi = psi.hampel)
summary(robust_model)
## 
## Call: rlm(formula = medv ~ ., data = train_data, psi = psi.hampel)
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.5482 -1.9878 -0.2963  2.0122 33.3536 
## 
## Coefficients:
##             Value   Std. Error t value
## (Intercept) 17.2828  4.5272     3.8175
## crim        -0.0891  0.0263    -3.3832
## zn           0.0348  0.0124     2.8003
## indus       -0.0333  0.0586    -0.5678
## chas         2.4038  0.7626     3.1523
## nox         -7.5058  3.4756    -2.1595
## rm           5.2540  0.3716    14.1394
## age         -0.0256  0.0120    -2.1229
## dis         -1.1153  0.1790    -6.2314
## rad          0.1528  0.0600     2.5465
## tax         -0.0111  0.0035    -3.2053
## ptratio     -0.7759  0.1190    -6.5207
## black        0.0095  0.0026     3.7138
## lstat       -0.3370  0.0440    -7.6554
## 
## Residual standard error: 2.962 on 340 degrees of freedom
# 6. Fit a quantile regression model
library(quantreg)
## Loading required package: SparseM
## 
## Attaching package: 'SparseM'
## The following object is masked from 'package:base':
## 
##     backsolve
quantile_model <- rq(medv ~ ., data = train_data, tau = 0.5)
summary(quantile_model)
## Warning in rq.fit.br(x, y, tau = tau, ci = TRUE, ...): Solution may be
## nonunique
## 
## Call: rq(formula = medv ~ ., tau = 0.5, data = train_data)
## 
## tau: [1] 0.5
## 
## Coefficients:
##             coefficients lower bd upper bd
## (Intercept) 23.62653      8.99790 34.09720
## crim        -0.13353     -0.14419 -0.03691
## zn           0.04243      0.01556  0.05557
## indus       -0.06134     -0.10983  0.04349
## chas         2.14716      1.15965  3.69222
## nox         -5.26155     -8.42533  0.43382
## rm           3.77516      2.17125  5.47205
## age         -0.01590     -0.03364  0.00596
## dis         -1.17924     -1.43869 -0.79739
## rad          0.16148      0.10585  0.27347
## tax         -0.00952     -0.01651 -0.00584
## ptratio     -0.64128     -0.81483 -0.48623
## black        0.00843      0.00520  0.01474
## lstat       -0.45333     -0.56516 -0.27059
# 7. Fit an elastic net regression model
enet_model <- glmnet(x = as.matrix(train_data[, -14]), y = train_data$medv, alpha = 0.5)
summary(enet_model)
##           Length Class     Mode   
## a0          77   -none-    numeric
## beta      1001   dgCMatrix S4     
## df          77   -none-    numeric
## dim          2   -none-    numeric
## lambda      77   -none-    numeric
## dev.ratio   77   -none-    numeric
## nulldev      1   -none-    numeric
## npasses      1   -none-    numeric
## jerr         1   -none-    numeric
## offset       1   -none-    logical
## call         4   -none-    call   
## nobs         1   -none-    numeric
# 8. Fit a support vector regression model
library(e1071)
svm_model <- svm(medv ~ ., data = train_data, kernel = "radial", cost = 100, epsilon = 0.1)
summary(svm_model)
## 
## Call:
## svm(formula = medv ~ ., data = train_data, kernel = "radial", cost = 100, 
##     epsilon = 0.1)
## 
## 
## Parameters:
##    SVM-Type:  eps-regression 
##  SVM-Kernel:  radial 
##        cost:  100 
##       gamma:  0.07692308 
##     epsilon:  0.1 
## 
## 
## Number of Support Vectors:  251
# 9. Fit a principal component regression model
library(pls)
## 
## Attaching package: 'pls'
## The following object is masked from 'package:stats':
## 
##     loadings
pcr_model <- pcr(medv ~ ., data = train_data, ncomp = 5, scale = TRUE)
summary(pcr_model)
## Data:    X dimension: 354 13 
##  Y dimension: 354 1
## Fit method: svdpc
## Number of components considered: 5
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps
## X       48.02    59.39    68.53    75.20    81.14
## medv    36.41    45.73    61.70    65.88    68.00
# 10. Fit a partial least squares regression model
pls_model <- plsr(medv ~ ., data = train_data, ncomp = 5, scale = TRUE)
summary(pls_model)
## Data:    X dimension: 354 13 
##  Y dimension: 354 1
## Fit method: kernelpls
## Number of components considered: 5
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps
## X       46.78    57.68    65.47    70.81    76.35
## medv    48.40    69.12    71.21    72.58    73.05

Note that the exact parameter choices and model specifications may vary depending on the specific research question and dataset being analyzed. Additionally, it is important to evaluate the performance of each model on the testing set to determine which model provides the best predictions for new data.

Module II: Regression Models for Count (Discrete) Variable Responses

Introduction

Module II of our course covers regression models for discrete variable responses, with a focus on count data. In this module, we explored 10 different count regression models that can be used to analyze data with discrete, non-negative count outcomes. These models include:

  1. Poisson regression
  2. Negative binomial regression
  3. Quasi-Poisson regression
  4. Geometric regression
  5. Zero-inflated Poisson regression
  6. Zero-inflated negative binomial regression
  7. Zero-truncated Poisson regression
  8. Zero-truncated negative binomial regression
  9. Hurdle Poisson regression
  10. Hurdle negative binomial regression

Each of these models has its own strengths and weaknesses, and is appropriate for different types of count data. For example, Poisson regression is often used when the mean and variance of the count data are approximately equal, while negative binomial regression is useful when there is overdispersion in the data (i.e., the variance is greater than the mean). The zero-inflated and hurdle models are designed to handle situations where there are excess zeros in the data, which can be caused by a variety of factors such as measurement error or the presence of a sub-population with a different underlying distribution.

Overall, this module provides a comprehensive overview of the different count regression models that are available for analyzing count data, and will be useful for anyone working with this type of data in their research or professional work.

Poisson Regression

Introduction:

Poisson regression is a type of regression analysis that is used to model count data. It is based on the Poisson distribution, which is a probability distribution that describes the number of times an event occurs in a fixed interval of time or space. Poisson regression is commonly used in fields such as epidemiology, biology, and economics, where the outcome variable is a count of events.

Model Formulation:

The Poisson regression model assumes that the dependent variable \(Y\) follows a Poisson distribution with a mean parameter \(\lambda\). The model assumes that the log of the mean parameter is a linear function of the independent variables \(X\), and is given by the following equation:

\[ \ln(\lambda) = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \cdots + \beta_k X_k \]

where \(\beta_0\) is the intercept, \(\beta_1, \beta_2, \ldots, \beta_k\) are the coefficients of the independent variables \(X_1, X_2, \ldots, X_k\), respectively.

Model Assumptions:

The assumptions of the Poisson regression model are:

  • Independence: The observations should be independent of each other. This means that the occurrence of an event for one unit should not affect the occurrence of an event for another unit.

  • Count data: The dependent variable should be count data, which means that it should be a discrete variable that takes non-negative integer values.

  • Linearity: The relationship between the logarithm of the mean parameter and the independent variables should be linear. This means that the effect of the independent variables on the dependent variable should be proportional to the change in the mean of the dependent variable.

  • Homogeneity of variance: The variance of the dependent variable should be equal to its mean, which is the defining property of the Poisson distribution. This assumption is referred to as equidispersion. If the variance is greater than the mean (overdispersion) or less than the mean (underdispersion), then the Poisson regression model may not be appropriate.

  • No multicollinearity: The independent variables should not be highly correlated with each other. If there is multicollinearity among the independent variables, then the coefficients of the independent variables may be unstable and difficult to interpret.

  • Lack of influential outliers: Outliers or extreme values in the independent variables or the dependent variable can affect the results of the Poisson regression model. Therefore, it is important to check for influential outliers and consider their impact on the results.

Application to Real-life Data:

Poisson regression can be applied to a wide range of real-life data, such as the number of deaths due to a particular disease, the number of accidents on a particular road, or the number of customers visiting a store on a particular day. For example, a Poisson regression model can be used to analyze the factors that affect the number of accidents on a particular road, such as traffic volume, weather conditions, and road design.

Limitations:

Although Poisson regression is a powerful statistical model for analyzing count data, it has some limitations that should be considered. Here are some of the limitations of the Poisson regression model:

  • Overdispersion: The Poisson regression model assumes that the variance of the dependent variable is equal to its mean, which is called equidispersion. However, in some cases, the variance may be larger than the mean, which is called overdispersion. Overdispersion can lead to biased parameter estimates and incorrect inferences.

  • Zero Inflation: The Poisson regression model assumes that the probability of a count being zero is low. However, in some cases, there may be an excess number of zeros in the data, which is called zero inflation. Poisson regression may not be appropriate for such data, and alternative models such as zero-inflated Poisson regression or hurdle models may be more appropriate.

  • Nonlinear relationships: The Poisson regression model assumes that the relationship between the dependent variable and the independent variables is linear. If the relationship is nonlinear, then the Poisson regression model may not capture the true relationship between the variables.

  • Outliers: The Poisson regression model is sensitive to outliers, which can affect the estimates of the regression coefficients and the goodness-of-fit of the model.

  • Independence: The Poisson regression model assumes that the observations are independent of each other. However, in some cases, the observations may be correlated, which can violate the independence assumption of the model.

  • Model Selection: Poisson regression models can be complex, and there may be several models with different independent variables that can fit the data equally well. Choosing the best model can be challenging and requires careful consideration of the model assumptions, goodness-of-fit, and predictive accuracy.

  • Large counts: The Poisson regression model may not be appropriate for count data with very large counts, especially if the sample size is small. As the counts get larger, the normal approximation used in the Poisson regression model may become less accurate, and alternative models such as negative binomial regression may be more appropriate.

  • Continuous independent variables: The Poisson regression model assumes that the independent variables are categorical or count variables. If the independent variables are continuous, the Poisson regression model may not capture the true relationship between the variables. In such cases, a transformation of the independent variables or the use of alternative models such as generalized linear models may be necessary.

  • Missing data: The Poisson regression model may not be appropriate for count data with missing values, as the missing values can affect the estimates of the regression coefficients and the goodness-of-fit of the model. In such cases, methods such as multiple imputation or maximum likelihood estimation may be used to handle missing data.

  • Model assumptions: The Poisson regression model assumes that the data follows a Poisson distribution and that the relationship between the dependent variable and the independent variables is linear. Violations of these assumptions can lead to biased parameter estimates and incorrect inferences. Therefore, it is important to assess the goodness-of-fit of the model and to consider alternative models if the assumptions are not met.

  • Zero truncation: Zero truncation occurs when the data contains only positive counts, and there are no zeros. This can happen when the data is collected from a population that is guaranteed to produce at least one count. The Poisson regression model assumes that the probability of a count being zero is positive, and therefore may not be appropriate for zero-truncated data. In such cases, zero-truncated Poisson regression or zero-inflated Poisson regression may be more appropriate.

  • Zero asymmetry: Zero asymmetry occurs when there are more zeros in the data than would be expected under a Poisson distribution. This can happen when the data contains a mixture of two populations, one with a high probability of producing zero counts and another with a Poisson distribution for the positive counts. The Poisson regression model assumes that the probability of a count being zero is low, and therefore may not be appropriate for zero-asymmetric data. In such cases, zero-inflated Poisson regression or hurdle models may be more appropriate.

In summary, while the Poisson regression model is a powerful tool for analyzing count data, it has several limitations that should be considered. Choosing the appropriate model requires careful consideration of the characteristics of the data and the assumptions of the model.

Alternative Models

  • Zero-truncation: If the count data has zero truncation (data contains only positive counts, and there are no zeros), Zero-truncated Poisson regression (or Zero-truncated Negative binomial regression) is an alternative to Poisson regression when the data contains only positive counts, and there are no zeros. This model assumes that the data follows a Poisson distribution (or negative binomial), but with the probability of a count being zero removed from the distribution.

  • Zero asymmetry: If the count data has more zeros than would be expected under a Poisson distribution, then Poisson regression may not be appropriate. Geometric regression can be an alternative if the data follows a geometric distribution, where the probability of observing the first success (i.e., non-zero count) follows a geometric distribution.

  • Overdispersion: If the variance of the count data is greater than the mean, then the assumption of equidispersion in Poisson regression is violated. Negative binomial regression is often used as an alternative in such situations. However, if the count data follows a geometric distribution, then geometric regression may be a suitable alternative as it can handle overdispersion.

  • Underdispersion: If the variance of the count data is less than the mean, then the assumption of equidispersion in Poisson regression is violated. Quasi-poisson regression is often used as an alternative in such situations. However, if the count data follows a geometric distribution, then geometric regression may be a suitable alternative as it can handle underdispersion.

  • Zero Inflation: Zero-inflated Poisson regression is an alternative to Poisson regression when there is zero inflation in the data, meaning that there are more zeros than expected under a Poisson distribution. This model is a mixture of two distributions, a Poisson distribution for the non-zero counts and a point mass at zero for the excess zeros.

  • Zero Inflation: Hurdle models are another alternative to Poisson regression when there is zero inflation in the data. This model is a two-part model, with a logistic regression model predicting the probability of a count being zero (the hurdle) and a truncated Poisson or negative binomial model for the positive counts.

Summary

  1. Negative binomial regression: Negative binomial regression becomes an alternative to Poisson regression when there is overdispersion in the data, which means that the variance is greater than the mean.

  2. Quasi-Poisson regression: Quasi-Poisson regression becomes an alternative to Poisson regression when there is overdispersion in the data, but the negative binomial distribution is not appropriate, or when there is underdispersion in the data.

  3. Geometric regression: Geometric regression becomes an alternative to Poisson regression when the data contains only non-zero counts and the dv follows a geometric distribution.

  4. Zero-inflated Poisson regression: Zero-inflated Poisson regression becomes an alternative to Poisson regression when the data contains excess zeros, which means that there are more zeros in the data than would be expected from a Poisson distribution.

  5. Zero-inflated negative binomial regression: Zero-inflated negative binomial regression becomes an alternative to Poisson regression when the data contains excess zeros and overdispersion.

  6. Zero-truncated Poisson regression: Zero-truncated Poisson regression becomes an alternative to Poisson regression when the data contains non-zero counts, and there is equidistribution of the data.

  7. Zero-truncated negative binomial regression: Zero-truncated negative binomial regression becomes an alternative to Poisson regression when the data contains non-zero counts, and the variance is greater than the mean.

  8. Hurdle Poisson regression: Hurdle Poisson regression becomes an alternative to Poisson regression when the data contains excess zeros and no disperion.

  9. Hurdle negative binomial regression: Hurdle negative binomial regression becomes an alternative to Poisson regression when the data contains excess zeros, and overdispersion.

Overall, these different regression models become alternatives to Poisson regression in various situations where the assumptions of Poisson regression are not met or where the data has certain characteristics such as excess zeros, overdispersion, or only non-zero counts. Understanding the appropriate regression model to use for a given data set is important for obtaining accurate and reliable predictions and insights.

10 Count Regression Models

Poisson Regression Model:

  • The Poisson regression model is used when the count data have a constant mean and variance. The Poisson regression model assumes that the count data follow a Poisson distribution and models the expected count as a function of one or more predictor variables.

  • Example: A researcher wants to model the number of traffic accidents per day on a busy road. The data show that the number of accidents per day is relatively stable over time and across different weather conditions.

Negative Binomial Regression Model:

  • The Negative Binomial regression model is used when the count data have overdispersion, meaning that the variance is greater than the mean. The Negative Binomial regression model assumes that the count data follow a Negative Binomial distribution and models the expected count as a function of one or more predictor variables.

  • Example: A researcher wants to model the number of books read per month by a book club. The data show that some members read many more books than others, resulting in overdispersion.

Geometric Regression Model:

  • The Geometric regression model is used when the count data have a probability of zero counts and a decreasing probability of non-zero counts. The Geometric regression model assumes that the count data follow a Geometric distribution and models the expected count as a function of one or more predictor variables.

  • Example: A researcher wants to model the number of attempts it takes for a student to pass a difficult exam. The data show that some students pass on their first attempt, but most require multiple attempts.

Quasi-Poisson Regression Model:

  • The Quasi-Poisson regression model is used when the count data have underdispersion (or overdispersion) and the Negative Binomial regression model cannot be used due to lack of convergence or other issues. The Quasi-Poisson regression model assumes that the count data follow a Poisson distribution, but allows for the variance to be greater than the mean.

  • Example: A researcher wants to model the number of visits per day to a website. The data show a lot of variability in the number of visits per day, but the Poisson and Negative Binomial models do not provide a good fit.

Zero-Inflated Poisson (ZIP) Model:

  • The ZIP model assumes that the count data follow a Poisson distribution but includes an additional component to account for the excess zeros. The model consists of two parts: a binary component that models the occurrence of zeros and a Poisson component that models the non-zero counts. The binary component can be modeled using a logistic regression model, and the Poisson component can be modeled using a Poisson regression model.

  • Example: A researcher wants to model the number of medical appointments per month for patients with a chronic illness. The data show an excess of patients who did not schedule any appointments in a given month.

Zero-Truncated Poisson (ZTP) Model:

  • The ZTP model assumes that the count data follow a Poisson distribution but excludes the zeros from the model. The ZTP model is used when the zeros in the data are not excess zeros, but are instead the result of a truncated distribution. The ZTP model can be fitted using a Poisson regression model on the non-zero counts only.

  • Example: A researcher wants to model the number of fish caught per fishing trip. The data show that some fishing trips resulted in zero fish caught due to weather conditions or other factors, but these zeros are not excess zeros.

Hurdle Poisson Model:

  • The hurdle model is a two-part model that is similar to the ZIP model but uses a different distribution for the non-zero counts. The hurdle model consists of a binary component that models the occurrence of zeros and a truncated Poisson or negative binomial component that models the non-zero counts.

  • Example: A researcher wants to model the number of car accidents per day on a busy intersection. The data show an excess of days with no accidents and a right-skewed distribution of the non-zero counts.

Zero-Inflated Negative Binomial (ZINB) Model:

  • The ZINB model is a generalization of the ZIP model that allows for overdispersion in the count data. The ZINB model consists of two parts: a binary component that models the occurrence of zeros and a negative binomial component that models the non-zero counts. The binary component can be modeled using a logistic regression model, and the negative binomial component can be modeled using a negative binomial regression model.

  • Example: A researcher wants to model the number of accidents per month in a city. The data show an excess of months with no accidents and overdispersion in the non-zero counts.

Zero-Truncated Negative Binomial (ZTNB) Model:

  • The ZTNB model is a generalization of the ZTP model that allows for overdispersion in the count data. The ZTNB model assumes that the count data follow a negative binomial distribution but excludes the zeros from the model. The ZTNB model can be fitted using a negative binomial regression model on the non-zero counts only.

  • Example: A researcher wants to model the number of books read per month by a book club. The data show that members who did not read any books in a given month did not attend the book club meeting.

Hurdle Negative Binomial Model:

  • The hurdle model is a two-part model that is similar to the ZIP model but uses a different distribution for the non-zero counts. The hurdle model consists of a binary component that models the occurrence of zeros and a truncated negative binomial component that models the non-zero counts.

  • Example: A researcher wants to model the number of visits per month to a website. The data show an excess of visitors who did not visit the website in a given month and a right-skewed distribution of the non-zero counts.

Real-Life Example - Drafting a Research Paper

A researcher wants to model “Risk Factors of Underfive Mortality in Togdheer: Application of Count Regression Models”. Evidence from the SLDHS data 2020.

Introduction

Under-five mortality refers to the death of children under the age of five years. Despite the significant progress that has been made in reducing under-five mortality globally, the burden of child mortality remains high in many developing countries, including Togdheer. Togdheer is one of the regions in Somaliland that has been severely affected by conflict, drought, and poverty, leading to poor health outcomes, including high under-five mortality rates.

Understanding the risk factors associated with under-five mortality in Togdheer is crucial for developing effective strategies to reduce child mortality. The 2020 Somaliland Demographic and Health Survey (SLDHS) provides a unique opportunity to investigate the risk factors associated with under-five mortality in Togdheer.

This study aims to investigate the risk factors of under-five mortality in Togdheer, using count regression models. The study will utilize the SLDHS 2020 data to identify the factors associated with under-five mortality in Togdheer and estimate the magnitude of their effects.

The findings of this study will contribute to the existing literature on child mortality and provide important insights for policymakers and stakeholders in Togdheer to develop and implement effective interventions to reduce under-five mortality. Additionally, the study will provide evidence-based recommendations for improving child health outcomes in Togdheer and other similar settings.

Materials and Methods

Study Design and Data Source:

This study utilized a cross-sectional study design and analyzed data from the 2020 Somaliland Demographic and Health Survey (SLDHS). The SLDHS is a nationally representative survey that provides data on health indicators, including under-five mortality, maternal and child health, and other relevant demographic and socioeconomic variables.

Study Population:

The study population consisted of children under the age of five years who were born to women aged 15-49 years and were living in Togdheer region at the time of the survey.

Data Collection:

The SLDHS used a stratified two-stage cluster sampling design to select a representative sample of households. In the first stage, clusters (enumeration areas) were selected using probability proportional to size sampling. In the second stage, households were selected using systematic sampling. Trained interviewers collected data using standardized questionnaires, including the Household Questionnaire, Women’s Questionnaire, and Under-Five Questionnaire.

Variables:

The outcome variable for this study was under-five mortality, defined as the number of deaths among children under the age of five years per 1,000 live births. The main independent variables were maternal and child demographic and socioeconomic characteristics, including maternal age, education, occupation, household wealth index, access to health care services, among others.

Statistical Analysis:

Descriptive statistics were used to summarize the characteristics of the study population and the prevalence of under-five mortality. Count regression models, including Poisson, negative binomial, zero inflated poisson, zero inflated negative binomial, hurdle poisson, and hurdle negative binomial regression models, were used to identify the risk factors associated with under-five mortality and estimate the magnitude of their effects.

Ethical Considerations:

The SLDHS was approved by the Somali Ministry of Health and implemented by the National Statistics Directorate. Informed consent was obtained from all participants, and all data were kept confidential and anonymous. This study was approved by the Ethics Review Committee of the University of Togdheer.

Results

Inferential Analysis

# Load library
library(pscl)
## Classes and Methods for R developed in the
## Political Science Computational Laboratory
## Department of Political Science
## Stanford University
## Simon Jackman
## hurdle and zeroinfl functions by Achim Zeileis
library(MASS)
library(actuar)
## Warning: package 'actuar' was built under R version 4.1.3
## 
## Attaching package: 'actuar'
## The following objects are masked from 'package:stats':
## 
##     sd, var
## The following object is masked from 'package:grDevices':
## 
##     cm
# Load data
library(haven)
## Warning: package 'haven' was built under R version 4.1.3
data <- read_dta("Togdheer2.dta")
data<-data[data$region==3,]
data$childsex<-factor(data$childsex)
data$childtwin<-factor(data$childtwin)
data$Residence<-factor(data$Residence)
data$breastfeeding<-factor(data$breastfeeding)
data$Education<-factor(data$Education)
data$toiletfacility<-factor(data$toiletfacility)
data$watersource<-factor(data$watersource)
data$wealthindex<-factor(data$wealthindex)
data$maritalstatus<-factor(data$maritalstatus)
# Fit Poisson model
poisson_model<-glm(numberofdeath~childsex+childtwin+childreneverborn+watersource+toiletfacility+wealthindex+maritalstatus+birthorder+breastfeeding+mothersageatfirstbirth+Residence+Education, data = data, family = poisson())
# Summarize results
summary(poisson_model)
## 
## Call:
## glm(formula = numberofdeath ~ childsex + childtwin + childreneverborn + 
##     watersource + toiletfacility + wealthindex + maritalstatus + 
##     birthorder + breastfeeding + mothersageatfirstbirth + Residence + 
##     Education, family = poisson(), data = data)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.94809  -0.99148   0.02731   0.47494   1.55367  
## 
## Coefficients:
##                          Estimate Std. Error z value Pr(>|z|)    
## (Intercept)            -2.338e+00  2.365e-01  -9.885  < 2e-16 ***
## childsex2              -9.819e-02  6.186e-02  -1.587 0.112465    
## childtwin2              7.330e-03  1.915e-01   0.038 0.969471    
## childreneverborn        1.030e-01  1.316e-02   7.822 5.19e-15 ***
## watersource2            2.566e-01  7.694e-02   3.335 0.000852 ***
## toiletfacility2        -3.069e-01  1.008e-01  -3.046 0.002322 ** 
## wealthindex2            6.405e-01  1.471e-01   4.354 1.34e-05 ***
## wealthindex3            4.012e-01  1.450e-01   2.767 0.005664 ** 
## wealthindex4            5.634e-01  1.377e-01   4.091 4.30e-05 ***
## wealthindex5            5.010e-01  1.405e-01   3.567 0.000362 ***
## maritalstatus2         -1.377e-01  1.133e-01  -1.215 0.224385    
## maritalstatus3         -3.958e-01  1.487e-01  -2.662 0.007758 ** 
## birthorder             -8.997e-06  1.175e-02  -0.001 0.999389    
## breastfeeding2          6.093e-01  9.928e-02   6.137 8.43e-10 ***
## mothersageatfirstbirth  3.417e-02  7.752e-03   4.408 1.04e-05 ***
## Residence2             -1.018e-01  1.041e-01  -0.978 0.328024    
## Residence3             -1.163e-02  9.838e-02  -0.118 0.905877    
## Education2              1.766e-02  1.139e-01   0.155 0.876753    
## Education3             -2.737e-01  2.346e-01  -1.166 0.243415    
## Education4              3.973e-01  4.652e-01   0.854 0.392997    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 1054.56  on 1004  degrees of freedom
## Residual deviance:  778.08  on  985  degrees of freedom
## AIC: 2417.4
## 
## Number of Fisher Scoring iterations: 5
# Fit Negative Binomial model
nb_model <- glm.nb(numberofdeath~childsex+childtwin+childreneverborn+watersource+toiletfacility+wealthindex+maritalstatus+birthorder+breastfeeding+mothersageatfirstbirth+Residence+Education, data = data)
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
# Summarize results
summary(nb_model)
## 
## Call:
## glm.nb(formula = numberofdeath ~ childsex + childtwin + childreneverborn + 
##     watersource + toiletfacility + wealthindex + maritalstatus + 
##     birthorder + breastfeeding + mothersageatfirstbirth + Residence + 
##     Education, data = data, init.theta = 24126.54212, link = log)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9481  -0.9915   0.0273   0.4749   1.5536  
## 
## Coefficients:
##                          Estimate Std. Error z value Pr(>|z|)    
## (Intercept)            -2.338e+00  2.365e-01  -9.884  < 2e-16 ***
## childsex2              -9.819e-02  6.187e-02  -1.587 0.112469    
## childtwin2              7.332e-03  1.915e-01   0.038 0.969462    
## childreneverborn        1.030e-01  1.316e-02   7.822 5.21e-15 ***
## watersource2            2.566e-01  7.694e-02   3.335 0.000853 ***
## toiletfacility2        -3.069e-01  1.008e-01  -3.046 0.002322 ** 
## wealthindex2            6.405e-01  1.471e-01   4.354 1.34e-05 ***
## wealthindex3            4.013e-01  1.450e-01   2.767 0.005663 ** 
## wealthindex4            5.635e-01  1.377e-01   4.091 4.30e-05 ***
## wealthindex5            5.010e-01  1.405e-01   3.567 0.000362 ***
## maritalstatus2         -1.377e-01  1.133e-01  -1.215 0.224399    
## maritalstatus3         -3.958e-01  1.487e-01  -2.662 0.007760 ** 
## birthorder             -8.966e-06  1.175e-02  -0.001 0.999391    
## breastfeeding2          6.093e-01  9.929e-02   6.136 8.44e-10 ***
## mothersageatfirstbirth  3.417e-02  7.752e-03   4.408 1.04e-05 ***
## Residence2             -1.018e-01  1.041e-01  -0.978 0.328041    
## Residence3             -1.164e-02  9.838e-02  -0.118 0.905821    
## Education2              1.766e-02  1.139e-01   0.155 0.876721    
## Education3             -2.737e-01  2.346e-01  -1.166 0.243415    
## Education4              3.973e-01  4.652e-01   0.854 0.393011    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(24126.54) family taken to be 1)
## 
##     Null deviance: 1054.52  on 1004  degrees of freedom
## Residual deviance:  778.06  on  985  degrees of freedom
## AIC: 2419.4
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  24127 
##           Std. Err.:  96258 
## Warning while fitting theta: iteration limit reached 
## 
##  2 x log-likelihood:  -2377.401
# Fit Quasi-Poisson model
qp_model <- glm(numberofdeath~childsex+childtwin+childreneverborn+watersource+toiletfacility+wealthindex+maritalstatus+birthorder+breastfeeding+mothersageatfirstbirth+Residence+Education, data = data, family = quasi(link = "log", variance = "mu^2"))

# Summarize results
summary(qp_model)
## 
## Call:
## glm(formula = numberofdeath ~ childsex + childtwin + childreneverborn + 
##     watersource + toiletfacility + wealthindex + maritalstatus + 
##     birthorder + breastfeeding + mothersageatfirstbirth + Residence + 
##     Education, family = quasi(link = "log", variance = "mu^2"), 
##     data = data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.5645   0.0000   0.0000   0.4933   2.4611  
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            -2.2181763  0.2076058 -10.685  < 2e-16 ***
## childsex2              -0.1843738  0.0582980  -3.163 0.001611 ** 
## childtwin2              0.0821773  0.2083870   0.394 0.693409    
## childreneverborn        0.0933266  0.0127406   7.325 4.96e-13 ***
## watersource2            0.3368912  0.0675941   4.984 7.35e-07 ***
## toiletfacility2        -0.3728549  0.0925331  -4.029 6.02e-05 ***
## wealthindex2            0.7159793  0.1214427   5.896 5.12e-09 ***
## wealthindex3            0.7812116  0.1196540   6.529 1.06e-10 ***
## wealthindex4            0.7747426  0.1083647   7.149 1.70e-12 ***
## wealthindex5            0.7702721  0.1111336   6.931 7.54e-12 ***
## maritalstatus2         -0.1689206  0.1072387  -1.575 0.115535    
## maritalstatus3         -0.3904472  0.1271843  -3.070 0.002200 ** 
## birthorder              0.0005146  0.0121542   0.042 0.966236    
## breastfeeding2          0.6008211  0.0754787   7.960 4.71e-15 ***
## mothersageatfirstbirth  0.0288872  0.0075836   3.809 0.000148 ***
## Residence2             -0.1453165  0.0967993  -1.501 0.133621    
## Residence3             -0.1671277  0.0928746  -1.799 0.072246 .  
## Education2              0.0782821  0.1144187   0.684 0.494027    
## Education3             -0.4491009  0.1879573  -2.389 0.017064 *  
## Education4              0.3690813  0.4270735   0.864 0.387683    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for quasi family taken to be 0.8339886)
## 
##     Null deviance: 255.51  on 1004  degrees of freedom
## Residual deviance: 264.35  on  985  degrees of freedom
## AIC: NA
## 
## Number of Fisher Scoring iterations: 19
# Fit ZIP model
zip_model <-zeroinfl(numberofdeath~childsex+childtwin+childreneverborn+watersource+toiletfacility+wealthindex+maritalstatus+birthorder+breastfeeding+mothersageatfirstbirth+Residence+Education|1, data = data, dist = "poisson")

# Summarize results
summary(zip_model)
## 
## Call:
## zeroinfl(formula = numberofdeath ~ childsex + childtwin + childreneverborn + 
##     watersource + toiletfacility + wealthindex + maritalstatus + birthorder + 
##     breastfeeding + mothersageatfirstbirth + Residence + Education | 
##     1, data = data, dist = "poisson")
## 
## Pearson residuals:
##      Min       1Q   Median       3Q      Max 
## -1.37754 -0.70104  0.02747  0.52146  1.91064 
## 
## Count model coefficients (poisson with log link):
##                          Estimate Std. Error z value Pr(>|z|)    
## (Intercept)            -2.338e+00  2.365e-01  -9.886  < 2e-16 ***
## childsex2              -9.820e-02  6.186e-02  -1.587 0.112426    
## childtwin2              7.261e-03  1.915e-01   0.038 0.969758    
## childreneverborn        1.030e-01  1.316e-02   7.823 5.17e-15 ***
## watersource2            2.566e-01  7.694e-02   3.335 0.000852 ***
## toiletfacility2        -3.069e-01  1.008e-01  -3.045 0.002325 ** 
## wealthindex2            6.406e-01  1.471e-01   4.354 1.34e-05 ***
## wealthindex3            4.013e-01  1.450e-01   2.767 0.005652 ** 
## wealthindex4            5.635e-01  1.377e-01   4.091 4.30e-05 ***
## wealthindex5            5.011e-01  1.405e-01   3.567 0.000361 ***
## maritalstatus2         -1.377e-01  1.133e-01  -1.215 0.224487    
## maritalstatus3         -3.957e-01  1.486e-01  -2.662 0.007769 ** 
## birthorder             -1.015e-05  1.175e-02  -0.001 0.999311    
## breastfeeding2          6.093e-01  9.929e-02   6.137 8.42e-10 ***
## mothersageatfirstbirth  3.418e-02  7.751e-03   4.409 1.04e-05 ***
## Residence2             -1.018e-01  1.041e-01  -0.978 0.328076    
## Residence3             -1.166e-02  9.838e-02  -0.119 0.905636    
## Education2              1.771e-02  1.139e-01   0.156 0.876423    
## Education3             -2.736e-01  2.346e-01  -1.166 0.243486    
## Education4              3.980e-01  4.650e-01   0.856 0.392111    
## 
## Zero-inflation model coefficients (binomial with logit link):
##             Estimate Std. Error z value Pr(>|z|)
## (Intercept)   -14.85     108.51  -0.137    0.891
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Number of iterations in BFGS optimization: 45 
## Log-likelihood: -1189 on 21 Df
# Fit ZTP model
ztp_model <-glm(numberofdeath~childsex+childtwin+childreneverborn+watersource+toiletfacility+wealthindex+maritalstatus+birthorder+breastfeeding+mothersageatfirstbirth+Residence+Education, data = subset(data, numberofdeath > 0), family = poisson())

# Summarize results
summary(ztp_model)
## 
## Call:
## glm(formula = numberofdeath ~ childsex + childtwin + childreneverborn + 
##     watersource + toiletfacility + wealthindex + maritalstatus + 
##     birthorder + breastfeeding + mothersageatfirstbirth + Residence + 
##     Education, family = poisson(), data = subset(data, numberofdeath > 
##     0))
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -0.95934  -0.35266  -0.09722   0.25453   1.24787  
## 
## Coefficients:
##                          Estimate Std. Error z value Pr(>|z|)    
## (Intercept)            -0.8124986  0.2345634  -3.464 0.000532 ***
## childsex2               0.0079055  0.0626173   0.126 0.899533    
## childtwin2             -0.0762963  0.1921469  -0.397 0.691314    
## childreneverborn        0.0648403  0.0129062   5.024 5.06e-07 ***
## watersource2            0.0457141  0.0801554   0.570 0.568462    
## toiletfacility2        -0.3505887  0.1004278  -3.491 0.000481 ***
## wealthindex2            0.4110245  0.1473896   2.789 0.005292 ** 
## wealthindex3            0.1046121  0.1453998   0.719 0.471846    
## wealthindex4            0.1605622  0.1363616   1.177 0.239007    
## wealthindex5            0.0516197  0.1422557   0.363 0.716705    
## maritalstatus2          0.0703858  0.1164930   0.604 0.545707    
## maritalstatus3          0.0135826  0.1521566   0.089 0.928870    
## birthorder             -0.0005783  0.0118074  -0.049 0.960939    
## breastfeeding2          0.1732129  0.1015624   1.705 0.088105 .  
## mothersageatfirstbirth  0.0290687  0.0076072   3.821 0.000133 ***
## Residence2             -0.2246540  0.1104497  -2.034 0.041952 *  
## Residence3              0.1017992  0.1011931   1.006 0.314421    
## Education2              0.1318192  0.1134805   1.162 0.245397    
## Education3              0.1691714  0.2397268   0.706 0.480385    
## Education4              0.0123514  0.4659500   0.027 0.978852    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 251.07  on 693  degrees of freedom
## Residual deviance: 136.12  on 674  degrees of freedom
## AIC: 1775.4
## 
## Number of Fisher Scoring iterations: 4
# Fit hurdle poisson model
hurdlep_model<-hurdle(numberofdeath~childsex+childtwin+childreneverborn+watersource+toiletfacility+wealthindex+maritalstatus+birthorder+breastfeeding+mothersageatfirstbirth+Residence+Education|Residence, data = data, dist = "poisson")
# Summarize results
summary(hurdlep_model)
## 
## Call:
## hurdle(formula = numberofdeath ~ childsex + childtwin + childreneverborn + 
##     watersource + toiletfacility + wealthindex + maritalstatus + birthorder + 
##     breastfeeding + mothersageatfirstbirth + Residence + Education | 
##     Residence, data = data, dist = "poisson")
## 
## Pearson residuals:
##     Min      1Q  Median      3Q     Max 
## -1.4669 -1.1000  0.1561  0.4921  2.8689 
## 
## Count model coefficients (truncated poisson with log link):
##                          Estimate Std. Error z value Pr(>|z|)    
## (Intercept)            -3.358e+00  4.528e-01  -7.416 1.21e-13 ***
## childsex2               5.437e-02  9.745e-02   0.558  0.57690    
## childtwin2             -1.839e-01  2.788e-01  -0.660  0.50945    
## childreneverborn        1.583e-01  2.176e-02   7.275 3.47e-13 ***
## watersource2            4.877e-02  1.537e-01   0.317  0.75096    
## toiletfacility2        -9.117e-01  1.917e-01  -4.755 1.99e-06 ***
## wealthindex2            8.911e-01  2.751e-01   3.240  0.00120 ** 
## wealthindex3            1.108e-01  2.720e-01   0.407  0.68373    
## wealthindex4            5.008e-01  2.623e-01   1.909  0.05624 .  
## wealthindex5            6.011e-02  2.768e-01   0.217  0.82806    
## maritalstatus2          2.917e-01  1.935e-01   1.507  0.13172    
## maritalstatus3          1.285e-01  2.534e-01   0.507  0.61206    
## birthorder             -3.151e-03  1.649e-02  -0.191  0.84847    
## breastfeeding2          7.773e-01  2.373e-01   3.276  0.00105 ** 
## mothersageatfirstbirth  5.642e-02  1.274e-02   4.429 9.46e-06 ***
## Residence2             -5.274e-01  1.973e-01  -2.673  0.00753 ** 
## Residence3              4.022e-01  1.984e-01   2.027  0.04264 *  
## Education2              1.120e-01  1.851e-01   0.605  0.54503    
## Education3              5.508e-01  3.641e-01   1.513  0.13040    
## Education4             -1.382e+01  1.182e+03  -0.012  0.99067    
## Zero hurdle model coefficients (binomial with logit link):
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.80922    0.14911   5.427 5.73e-08 ***
## Residence2   0.07808    0.20142   0.388    0.698    
## Residence3  -0.05024    0.17584  -0.286    0.775    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Number of iterations in BFGS optimization: 43 
## Log-likelihood: -1185 on 23 Df
# Fit ZINB model
zinb_model <-zeroinfl(numberofdeath~childsex+childtwin+childreneverborn+watersource+toiletfacility+wealthindex+maritalstatus+birthorder+breastfeeding+mothersageatfirstbirth+Residence+Education|Residence, data = data, dist = "negbin")
## Warning in value[[3L]](cond): system is computationally singular: reciprocal
## condition number = 5.40204e-37FALSE
# Summarize results
summary(zinb_model)
## 
## Call:
## zeroinfl(formula = numberofdeath ~ childsex + childtwin + childreneverborn + 
##     watersource + toiletfacility + wealthindex + maritalstatus + birthorder + 
##     breastfeeding + mothersageatfirstbirth + Residence + Education | 
##     Residence, data = data, dist = "negbin")
## 
## Pearson residuals:
##      Min       1Q   Median       3Q      Max 
## -1.37751 -0.70108  0.02744  0.52144  1.91058 
## 
## Count model coefficients (negbin with log link):
##                          Estimate Std. Error z value Pr(>|z|)
## (Intercept)            -2.338e+00         NA      NA       NA
## childsex2              -9.819e-02         NA      NA       NA
## childtwin2              7.334e-03         NA      NA       NA
## childreneverborn        1.030e-01         NA      NA       NA
## watersource2            2.566e-01         NA      NA       NA
## toiletfacility2        -3.069e-01         NA      NA       NA
## wealthindex2            6.405e-01         NA      NA       NA
## wealthindex3            4.012e-01         NA      NA       NA
## wealthindex4            5.634e-01         NA      NA       NA
## wealthindex5            5.010e-01         NA      NA       NA
## maritalstatus2         -1.377e-01         NA      NA       NA
## maritalstatus3         -3.958e-01         NA      NA       NA
## birthorder             -9.030e-06         NA      NA       NA
## breastfeeding2          6.093e-01         NA      NA       NA
## mothersageatfirstbirth  3.417e-02         NA      NA       NA
## Residence2             -1.018e-01         NA      NA       NA
## Residence3             -1.163e-02         NA      NA       NA
## Education2              1.766e-02         NA      NA       NA
## Education3             -2.737e-01         NA      NA       NA
## Education4              3.973e-01         NA      NA       NA
## Log(theta)              1.483e+01         NA      NA       NA
## 
## Zero-inflation model coefficients (binomial with logit link):
##             Estimate Std. Error z value Pr(>|z|)
## (Intercept)  -55.515         NA      NA       NA
## Residence2   -21.636         NA      NA       NA
## Residence3    -5.911         NA      NA       NA
## 
## Theta = 2745497.5196 
## Number of iterations in BFGS optimization: 50 
## Log-likelihood: -1189 on 24 Df
# Fit ZTNB model
ztnb_model <-glm.nb(numberofdeath~childsex+childtwin+childreneverborn+watersource+toiletfacility+wealthindex+maritalstatus+birthorder+breastfeeding+mothersageatfirstbirth+Residence+Education, data = subset(data, numberofdeath > 0))
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached

## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
# Summarize results
summary(ztnb_model)
## 
## Call:
## glm.nb(formula = numberofdeath ~ childsex + childtwin + childreneverborn + 
##     watersource + toiletfacility + wealthindex + maritalstatus + 
##     birthorder + breastfeeding + mothersageatfirstbirth + Residence + 
##     Education, data = subset(data, numberofdeath > 0), init.theta = 130858.5048, 
##     link = log)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -0.95933  -0.35266  -0.09722   0.25453   1.24785  
## 
## Coefficients:
##                          Estimate Std. Error z value Pr(>|z|)    
## (Intercept)            -0.8124955  0.2345652  -3.464 0.000533 ***
## childsex2               0.0079056  0.0626178   0.126 0.899533    
## childtwin2             -0.0762969  0.1921485  -0.397 0.691314    
## childreneverborn        0.0648402  0.0129063   5.024 5.06e-07 ***
## watersource2            0.0457144  0.0801560   0.570 0.568463    
## toiletfacility2        -0.3505888  0.1004286  -3.491 0.000481 ***
## wealthindex2            0.4110241  0.1473907   2.789 0.005292 ** 
## wealthindex3            0.1046129  0.1454008   0.719 0.471845    
## wealthindex4            0.1605624  0.1363625   1.177 0.239009    
## wealthindex5            0.0516200  0.1422566   0.363 0.716705    
## maritalstatus2          0.0703858  0.1164938   0.604 0.545709    
## maritalstatus3          0.0135823  0.1521576   0.089 0.928871    
## birthorder             -0.0005783  0.0118075  -0.049 0.960939    
## breastfeeding2          0.1732129  0.1015631   1.705 0.088107 .  
## mothersageatfirstbirth  0.0290686  0.0076073   3.821 0.000133 ***
## Residence2             -0.2246540  0.1104506  -2.034 0.041954 *  
## Residence3              0.1017986  0.1011939   1.006 0.314427    
## Education2              0.1318193  0.1134814   1.162 0.245401    
## Education3              0.1691715  0.2397285   0.706 0.480387    
## Education4              0.0123513  0.4659520   0.027 0.978852    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(130858.5) family taken to be 1)
## 
##     Null deviance: 251.07  on 693  degrees of freedom
## Residual deviance: 136.12  on 674  degrees of freedom
## AIC: 1777.4
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  130859 
##           Std. Err.:  887717 
## Warning while fitting theta: iteration limit reached 
## 
##  2 x log-likelihood:  -1735.423
# Fit hurdle negative binomial model
hurdlenb_model<-hurdle(numberofdeath~childsex+childtwin+childreneverborn+watersource+toiletfacility+wealthindex+maritalstatus+birthorder+breastfeeding+mothersageatfirstbirth+Residence+Education|Residence, data = data, dist = "negbin")
## Warning in sqrt(diag(vc_count)[kx + 1]): NaNs produced
# Summarize results
summary(hurdlenb_model)
## 
## Call:
## hurdle(formula = numberofdeath ~ childsex + childtwin + childreneverborn + 
##     watersource + toiletfacility + wealthindex + maritalstatus + birthorder + 
##     breastfeeding + mothersageatfirstbirth + Residence + Education | 
##     Residence, data = data, dist = "negbin")
## 
## Pearson residuals:
##     Min      1Q  Median      3Q     Max 
## -1.4669 -1.1000  0.1561  0.4922  2.8687 
## 
## Count model coefficients (truncated negbin with log link):
##                         Estimate Std. Error z value Pr(>|z|)    
## (Intercept)             -3.35809    0.45282  -7.416 1.21e-13 ***
## childsex2                0.05446    0.09745   0.559  0.57629    
## childtwin2              -0.18384    0.27875  -0.660  0.50957    
## childreneverborn         0.15828    0.02176   7.274 3.48e-13 ***
## watersource2             0.04881    0.15368   0.318  0.75078    
## toiletfacility2         -0.91178    0.19173  -4.755 1.98e-06 ***
## wealthindex2             0.89107    0.27505   3.240  0.00120 ** 
## wealthindex3             0.11082    0.27204   0.407  0.68373    
## wealthindex4             0.50077    0.26230   1.909  0.05624 .  
## wealthindex5             0.06013    0.27677   0.217  0.82801    
## maritalstatus2           0.29167    0.19351   1.507  0.13175    
## maritalstatus3           0.12853    0.25340   0.507  0.61198    
## birthorder              -0.00315    0.01649  -0.191  0.84854    
## breastfeeding2           0.77731    0.23729   3.276  0.00105 ** 
## mothersageatfirstbirth   0.05641    0.01274   4.429 9.46e-06 ***
## Residence2              -0.52740    0.19732  -2.673  0.00752 ** 
## Residence3               0.40220    0.19841   2.027  0.04265 *  
## Education2               0.11206    0.18509   0.605  0.54487    
## Education3               0.55082    0.36414   1.513  0.13037    
## Education4             -12.73724    1.26628 -10.059  < 2e-16 ***
## Log(theta)              14.07542        NaN     NaN      NaN    
## Zero hurdle model coefficients (binomial with logit link):
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.80922    0.14911   5.427 5.73e-08 ***
## Residence2   0.07808    0.20142   0.388    0.698    
## Residence3  -0.05024    0.17584  -0.286    0.775    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Theta: count = 1296814.9889
## Number of iterations in BFGS optimization: 44 
## Log-likelihood: -1185 on 24 Df

In summary, Poisson, Negative Binomial, Geometric, Quasi-Poisson, ZIP, ZINB, ZTP, ZTNB, hurdle poisson, and hurdle negative binomial regression models are all useful for analyzing count data, but they differ in their assumptions and approach. The choice of model depends on the nature of the data and the specific research question being addressed.

Model Comparison

# AIC
aic<-AIC(poisson_model,nb_model,zip_model,zinb_model,hurdlep_model,hurdlenb_model)
bic<-BIC(poisson_model,nb_model,zip_model,zinb_model,hurdlep_model,hurdlenb_model)
aic
##                df      AIC
## poisson_model  20 2417.383
## nb_model       21 2419.401
## zip_model      21 2419.383
## zinb_model     24 2425.383
## hurdlep_model  23 2415.942
## hurdlenb_model 24 2417.942
bic
##                df      BIC
## poisson_model  20 2515.637
## nb_model       21 2522.569
## zip_model      21 2522.550
## zinb_model     24 2543.289
## hurdlep_model  23 2528.935
## hurdlenb_model 24 2535.848

Interpretation

The results presented show the Akaike Information Criterion (AIC)and Bayesian Information Criterion (BIC) values for different count regression models fitted to the data. The information criterions (AIC and BIC) are a measure of the goodness of fit of a statistical model, where a lower (AIC and BIC) values indicates a better fit of the model to the data.

In this study, six count regression models were fitted to the data, including Poisson, negative binomial (NB), zero-inflated Poisson (ZIP), zero-inflated negative binomial (ZINB),hurdle poisson and hurdle negative binomial regression models.

The results show that the hurdle poisson regression model had the lowest AIC and BIC values of 2415.942 and 2528.935 respectively, indicating that it is the best-fitting model among the models considered. The hurdle poisson model is a two-part model that accounts for excess zeros in the data and equidistribution, making it a suitable model for count data with excess zeros and equidistribution.

The hurdle negative binomial model had the second-lowest AIC and BIC values of 2417.942 and 2535.848 respectively, followed by the Poisson model and the NB and ZIP models. The ZINB model had a slightly higher AIC value compared to the hurdle and the poisson models.

Overall, the results suggest that the hurdle poisson model provides the best fit for the data and should be used to identify the risk factors associated with under-five mortality in Togdheer.

Interpretation based on the best model

The results presented are from the hurdle Poisson regression model, which is a two-part model that accounts for excess zeros in the data and equidistribution. The model was used to investigate the risk factors associated with under-five mortality in Togdheer, using a set of independent variables.

The results are presented in two parts. The first part of the results shows the count model coefficients, which represent the association between the independent variables and the count of deaths among children under the age of five years. The intercept term represents the expected count of deaths when all other independent variables are held constant.

The results show that childreneverborn, toilet facility, Residence, breastfeeding, and mothersageatfirstbirth were significantly associated with the count of deaths among children under the age of five years. Specifically, children whose mothers used an improved toilets, were located in rural or nomadic, breastfed their children, and had their first child at a later age were associated with a lower count of under-five mortality.

Other variables, such as childsex, childtwin, watersource, wealthindex, birthorder, and Education, were not significantly associated with the count of under-five mortality in this study.

The second part of the results shows the zero hurdle model coefficients, which represent the association between the intercept and the probability of observing zero deaths among children under the age of five years. The results show that the intercept term was significantly associated with the probability of observing zero deaths.

The model’s log-likelihood was -5531, indicating a good fit of the model to the data. The Pearson residuals show that the model’s residuals are approximately normally distributed.

Overall, the results suggest that factors related to maternal and child health, such as childreneverborn, watersource, maritalstatus, breastfeeding, and mothersageatfirstbirth, are important predictors of under-five mortality in Togdheer. Policymakers and stakeholders should focus on implementing interventions to improve maternal and child health outcomes to reduce under-five mortality in the region.

For more information about drafting research paper using count regression models we can refer readers to (Argawu et al., 2022).

Module III: Regression Models for Categorical Variable Responses

Regression Models for Ordinal Variable Responses

Regression models with an ordinal response variable are commonly used in situations where the outcome of interest is an ordered categorical variable with three or more levels. In this type of model, the outcome variable is modeled as a function of one or more predictor variables. Here are five regression models commonly used with an ordinal response variable, along with their definition, application, assumptions, real-life examples, and R code:

Ordered Logistic Regression Model

Definition: Ordinal logistic regression is a statistical model used to analyze the relationship between one or more predictor variables and an ordinal outcome variable with three or more levels. In this model, the odds of being in a higher category of the outcome variable are modeled as a function of the predictor variables.

Application: Ordinal logistic regression can be used in various fields, such as healthcare, social sciences, and marketing, to analyze the relationship between predictor variables and an ordinal outcome variable, such as the severity of a disease, level of education, or customer satisfaction.

Assumptions: The assumptions of ordinal logistic regression include the proportional odds assumption, which states that the odds of being in a higher category of the outcome variable are the same across all levels of the predictor variables.

Real-life example: A researcher wants to investigate the relationship between the level of education and the severity of depression in a sample of patients. The outcome variable is the severity of depression, which ranges from mild to severe, and the predictor variable is the level of education, which is categorized as high school, college, or graduate school.

Ordered Probit Regression Model

Definition: Ordered probit regression is a regression model used to analyze the relationship between one or more predictor variables and an ordinal outcome variable with three or more levels. In this model, the probability of being in each category of the outcome variable is modeled as a function of the predictor variables using a probit link function.

Application: Ordered probit regression can be used in various fields, such as healthcare, social sciences, and marketing, to analyze the relationship between predictor variables and an ordinal outcome variable, such as the severity of a disease, level of education, or customer satisfaction.

Assumptions: The assumptions of ordered probit regression include the normality of the error term and the independence of the error terms across categories of the outcome variable.

Real-life example: A researcher wants to investigate the relationship between the level of physical activity and the level of depression in a sample of adults. The outcome variable is the level of depression, which is categorized as mild, moderate, or severe, and the predictor variable is the level of physical activity, which is categorized as low, moderate, or high.

Cumulative logit model:

Definition: The cumulative logit model is a regression model used to analyze the relationship between one or more predictor variables and an ordinal outcome variable with three or more levels. In this model, the cumulative probabilities of being in each category of the outcome variable are modeled as a function of the predictor variables.

Application: Cumulative logit models can be used in various fields, such as healthcare, social sciences, and marketing, to analyze the relationship between predictor variables and an ordinal outcome variable, such as the severity of a disease, level of education, or customer satisfaction.

Assumptions: The assumptions of the cumulative logit model include the proportional odds assumption, which states that the odds of being in a higher category of the outcome variable are the same across all levels of the predictor variables.

Real-life example: A researcher wants to investigate the relationship between the level of education and the level of income in a sample of workers. The outcome variable is the level of income, which is categorized as low, medium, or high, and the predictor variable is the level of education, which is categorized as high school, college, or graduate school.

Proportional Odds Model

Definition: Proportional odds model is a type of ordinal logistic regression model where the probability of a response variable taking a particular value or lower is modeled as a function of the predictor variables. It is used when the response variable is ordered or ordinal, and the categories have a natural ordering.

Application: The proportional odds model can be applied to a wide range of research areas, including social sciences, medical research, and marketing. It can be used to investigate the relationship between predictor variables and a range of ordinal outcomes, such as Likert scale responses, disease stages, or educational levels.

Assumptions: The proportional odds model assumes that the odds ratios for each predictor variable are constant across all levels of the response variable. It also assumes that the relationship between the predictor variables and the response variable is linear on the log-odds scale.

Real-life example: Suppose a researcher wants to investigate the relationship between age, gender, and income on the level of education attained. The response variable is ordered, with categories of “less than high school,” “high school,” “some college,” and “college degree or higher.” The proportional odds model can be used to model the probability of attaining a certain level of education based on age, gender, and income.

Partial Proportional Odds Model

Definition: The partial proportional odds model is a regression model used to analyze the relationship between one or more predictor variables and an ordinal outcome variable with three or more levels. In this model, the proportional odds assumption is relaxed for one or more of the predictor variables.

*Application: Partial proportional odds models can be used in various fields, such as healthcare, social sciences, and marketing, to analyze the relationship between predictor variables and an ordinal outcome variable when the proportional odds assumption is violated.

Assumptions: The assumptions of the partial proportional odds model include the assumption that the proportional odds assumption holds for at least one of the predictor variables.

Real-life example: A researcher wants to investigate the relationship between the level of social support and the level of depression in a sample of adults. The outcome variable is the level of depression, which is categorized as mild, moderate, or severe, and the predictor variable is the level of social support, which is categorized as low, moderate, or high. However, the proportional odds assumption does not hold for the level of social support variable.

R Code for All

# Load the package and data
library(MASS)
library(ordinal)
## Warning: package 'ordinal' was built under R version 4.1.3
## 
## Attaching package: 'ordinal'
## The following objects are masked from 'package:actuar':
## 
##     dgumbel, dlgamma, pgumbel, plgamma, qgumbel, rgumbel
data(Cars93)

# Create an ordered response variable
Cars93$Price <- cut(Cars93$Price, breaks = c(0, 15, 30, 45, 60, 75, 90, Inf), labels = c("0-15", "15-30", "30-45", "45-60", "60-75", "75-90", ">90"))
Cars93$Price <- ordered(Cars93$Price)

# Proportional odds model
model_proportional_odds <- polr(Price ~ MPG.city + MPG.highway + EngineSize, data = Cars93)
summary(model_proportional_odds)
## 
## Re-fitting to get Hessian
## Call:
## polr(formula = Price ~ MPG.city + MPG.highway + EngineSize, data = Cars93)
## 
## Coefficients:
##               Value Std. Error t value
## MPG.city    -0.7621     0.2465 -3.0910
## MPG.highway  0.2014     0.1670  1.2061
## EngineSize   0.2833     0.3743  0.7568
## 
## Intercepts:
##             Value    Std. Error t value 
## 0-15|15-30  -11.1564   3.4265    -3.2559
## 15-30|30-45  -6.5208   3.1823    -2.0491
## 30-45|45-60  -4.2840   3.2662    -1.3116
## 45-60|60-75  -3.5746   3.3428    -1.0694
## 
## Residual Deviance: 119.4952 
## AIC: 133.4952
# Partial proportional odds model
model_partial_proportional_odds <- polr(Price ~ MPG.city + MPG.highway + EngineSize, data = Cars93, method = "logistic")
summary(model_partial_proportional_odds)
## 
## Re-fitting to get Hessian
## Call:
## polr(formula = Price ~ MPG.city + MPG.highway + EngineSize, data = Cars93, 
##     method = "logistic")
## 
## Coefficients:
##               Value Std. Error t value
## MPG.city    -0.7621     0.2465 -3.0910
## MPG.highway  0.2014     0.1670  1.2061
## EngineSize   0.2833     0.3743  0.7568
## 
## Intercepts:
##             Value    Std. Error t value 
## 0-15|15-30  -11.1564   3.4265    -3.2559
## 15-30|30-45  -6.5208   3.1823    -2.0491
## 30-45|45-60  -4.2840   3.2662    -1.3116
## 45-60|60-75  -3.5746   3.3428    -1.0694
## 
## Residual Deviance: 119.4952 
## AIC: 133.4952
# Ordinal logistic regression model
model_ordinal_logistic <- clm(Price ~ MPG.city + MPG.highway + EngineSize, data = Cars93, link = "logit")
summary(model_ordinal_logistic)
## formula: Price ~ MPG.city + MPG.highway + EngineSize
## data:    Cars93
## 
##  link  threshold nobs logLik AIC    niter max.grad cond.H 
##  logit flexible  93   -59.75 133.50 7(0)  3.04e-10 8.4e+05
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## MPG.city     -0.7621     0.2465  -3.091  0.00199 **
## MPG.highway   0.2014     0.1670   1.206  0.22777   
## EngineSize    0.2833     0.3743   0.757  0.44919   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Threshold coefficients:
##             Estimate Std. Error z value
## 0-15|15-30   -11.157      3.427  -3.256
## 15-30|30-45   -6.521      3.182  -2.049
## 30-45|45-60   -4.284      3.266  -1.312
## 45-60|60-75   -3.575      3.343  -1.069
# Ordinal probit regression model
model_ordinal_probit <- clm(Price ~ MPG.city + MPG.highway + EngineSize, data = Cars93, link = "probit")
summary(model_ordinal_probit)
## formula: Price ~ MPG.city + MPG.highway + EngineSize
## data:    Cars93
## 
##  link   threshold nobs logLik AIC    niter max.grad cond.H 
##  probit flexible  93   -59.93 133.85 7(1)  5.82e-09 7.8e+05
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## MPG.city    -0.41922    0.13153  -3.187  0.00144 **
## MPG.highway  0.09881    0.09237   1.070  0.28476   
## EngineSize   0.14504    0.19981   0.726  0.46791   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Threshold coefficients:
##             Estimate Std. Error z value
## 0-15|15-30    -6.529      1.859  -3.512
## 15-30|30-45   -3.900      1.739  -2.243
## 30-45|45-60   -2.769      1.764  -1.570
## 45-60|60-75   -2.494      1.779  -1.402
# Complementary log-log model
model_cloglog<- clm(Price ~ MPG.city + MPG.highway + EngineSize, data = Cars93, link = "cloglog")
summary(model_cloglog)
## formula: Price ~ MPG.city + MPG.highway + EngineSize
## data:    Cars93
## 
##  link    threshold nobs logLik AIC    niter max.grad cond.H 
##  cloglog flexible  93   -62.56 139.12 8(2)  4.83e-13 9.1e+05
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## MPG.city    -0.41180    0.12828  -3.210  0.00133 **
## MPG.highway  0.06007    0.09191   0.654  0.51339   
## EngineSize   0.09129    0.18935   0.482  0.62972   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Threshold coefficients:
##             Estimate Std. Error z value
## 0-15|15-30    -8.191      2.064  -3.969
## 15-30|30-45   -5.341      1.852  -2.884
## 30-45|45-60   -4.449      1.849  -2.406
## 45-60|60-75   -4.293      1.852  -2.318
aic<-AIC(model_ordinal_logistic,model_ordinal_probit,model_proportional_odds,model_partial_proportional_odds)
aic
##                                 df      AIC
## model_ordinal_logistic           7 133.4952
## model_ordinal_probit             7 133.8548
## model_proportional_odds          7 133.4952
## model_partial_proportional_odds  7 133.4952

Regression Models for Binary Variable Responses

  1. Logistic regression: This is a type of regression model used when the dependent variable has two categories (e.g., yes or no, 1 or 0). The goal of logistic regression is to predict the probability of the positive category based on one or more predictor variables.

  2. Probit regression: This is a type of regression model that is similar to logistic regression but assumes that the relationship between the predictor variables and the outcome variable follows a normal distribution.

  3. Complementary log-log regression: This is a type of regression model that models the logarithm of the negative log of the probability of the positive category as a linear function of the predictor variables.

  4. The Cauchit regression model is a type of regression model that is used when the dependent variable is binary or dichotomous, meaning that it can take on only two values, such as 0 or 1. The Cauchit regression model is a type of generalized linear model (GLM) that is based on the Cauchy distribution.

  5. Generalized estimating equations (GEE): This is a type of regression model that is used when there are repeated measurements of the same binary outcome variable. GEE accounts for the correlation between the repeated measurements and allows for unbiased estimates of the model coefficients.

R Code

Here we analyze low birth weight data to illustrate the use of logistic regression. The dataset has been presented in Hosmer and Lemeshow (2004). The data set contains information on 189 births to women seen in the obstetric clinic, where data were collected as part of a larger study at Baystate Medical Center in Springfield, Massachusetts. The response variable LOW is a binary outcome indicating birth weight less than 2500 grams, which has been of concern to physicians for years.

# Load the dataset
library(brinla)
## Loading required package: INLA
## Loading required package: foreach
## Loading required package: parallel
## Loading required package: sp
## Warning: package 'sp' was built under R version 4.1.3
## This is INLA_22.03.03 built 2022-03-03 07:31:05 UTC.
##  - See www.r-inla.org/contact-us for how to get help.
data("lowbwt")
# Fit a logistic regression model
logitmodel <- glm(LOW ~ AGE+LWT+RACE+SMOKE+HT+UI+FTV, data =lowbwt, family = binomial(link = "logit"))

# View the model summary
summary(logitmodel)
## 
## Call:
## glm(formula = LOW ~ AGE + LWT + RACE + SMOKE + HT + UI + FTV, 
##     family = binomial(link = "logit"), data = lowbwt)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.7278  -0.8291  -0.5482   0.9709   2.1962  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept)  0.45481    1.18541   0.384  0.70122   
## AGE         -0.02051    0.03595  -0.570  0.56844   
## LWT         -0.01652    0.00686  -2.409  0.01600 * 
## RACE2        1.28976    0.52761   2.445  0.01451 * 
## RACE3        0.91907    0.43630   2.106  0.03516 * 
## SMOKE1       1.04159    0.39548   2.634  0.00844 **
## HT1          1.88506    0.69481   2.713  0.00667 **
## UI1          0.90415    0.44860   2.015  0.04385 * 
## FTV          0.05912    0.17200   0.344  0.73105   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 234.67  on 188  degrees of freedom
## Residual deviance: 203.83  on 180  degrees of freedom
## AIC: 221.83
## 
## Number of Fisher Scoring iterations: 4
# Fit a Probit regression model
probitmodel <- glm(LOW ~ AGE+LWT+RACE+SMOKE+HT+UI+FTV, data =lowbwt, family = binomial(link = "probit"))
# View the model summary
summary(probitmodel)
## 
## Call:
## glm(formula = LOW ~ AGE + LWT + RACE + SMOKE + HT + UI + FTV, 
##     family = binomial(link = "probit"), data = lowbwt)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.7043  -0.8491  -0.5351   0.9740   2.2252  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)   
## (Intercept)  0.256356   0.696796   0.368  0.71294   
## AGE         -0.012640   0.021179  -0.597  0.55063   
## LWT         -0.009645   0.003966  -2.432  0.01502 * 
## RACE2        0.762234   0.314450   2.424  0.01535 * 
## RACE3        0.540807   0.253656   2.132  0.03300 * 
## SMOKE1       0.634580   0.230601   2.752  0.00593 **
## HT1          1.122746   0.415545   2.702  0.00690 **
## UI1          0.546000   0.272855   2.001  0.04539 * 
## FTV          0.022830   0.101442   0.225  0.82193   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 234.67  on 188  degrees of freedom
## Residual deviance: 203.55  on 180  degrees of freedom
## AIC: 221.55
## 
## Number of Fisher Scoring iterations: 5
# Fit a Complementary Log-logistic regression model
cloglogmodel <- glm(LOW ~ AGE+LWT+RACE+SMOKE+HT+UI+FTV, data =lowbwt, family = binomial(link = "cloglog"))

# View the model summary
summary(cloglogmodel)
## 
## Call:
## glm(formula = LOW ~ AGE + LWT + RACE + SMOKE + HT + UI + FTV, 
##     family = binomial(link = "cloglog"), data = lowbwt)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8386  -0.8125  -0.5644   0.9948   2.1519  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)   
## (Intercept) -0.116450   0.914946  -0.127  0.89872   
## AGE         -0.017513   0.028348  -0.618  0.53671   
## LWT         -0.012515   0.005378  -2.327  0.01997 * 
## RACE2        1.102487   0.394463   2.795  0.00519 **
## RACE3        0.747427   0.338748   2.206  0.02735 * 
## SMOKE1       0.817161   0.301502   2.710  0.00672 **
## HT1          1.468116   0.456304   3.217  0.00129 **
## UI1          0.687640   0.331881   2.072  0.03827 * 
## FTV          0.076347   0.135185   0.565  0.57224   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 234.67  on 188  degrees of freedom
## Residual deviance: 204.14  on 180  degrees of freedom
## AIC: 222.14
## 
## Number of Fisher Scoring iterations: 6
# Fit a Cauchit regression model
caumodel <- glm(LOW ~ AGE+LWT+RACE+SMOKE+HT+UI+FTV, data =lowbwt, family = binomial(link = "cauchit"))

# View the model summary
summary(caumodel)
## 
## Call:
## glm(formula = LOW ~ AGE + LWT + RACE + SMOKE + HT + UI + FTV, 
##     family = binomial(link = "cauchit"), data = lowbwt)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8054  -0.7911  -0.5859   0.9584   2.0566  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)   
## (Intercept)  0.643511   1.266545   0.508  0.61139   
## AGE         -0.024189   0.038579  -0.627  0.53066   
## LWT         -0.018296   0.008279  -2.210  0.02712 * 
## RACE2        1.376953   0.573070   2.403  0.01627 * 
## RACE3        1.031063   0.498556   2.068  0.03863 * 
## SMOKE1       0.951250   0.443962   2.143  0.03214 * 
## HT1          2.009150   0.770681   2.607  0.00913 **
## UI1          0.904793   0.434412   2.083  0.03727 * 
## FTV          0.154221   0.183646   0.840  0.40103   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 234.67  on 188  degrees of freedom
## Residual deviance: 205.04  on 180  degrees of freedom
## AIC: 223.04
## 
## Number of Fisher Scoring iterations: 8
# Model Comparison
aic<-AIC(logitmodel,probitmodel,cloglogmodel,caumodel)
aic
##              df      AIC
## logitmodel    9 221.8310
## probitmodel   9 221.5472
## cloglogmodel  9 222.1358
## caumodel      9 223.0378
bic<-BIC(logitmodel,probitmodel,cloglogmodel,caumodel)
bic
##              df      BIC
## logitmodel    9 251.0067
## probitmodel   9 250.7229
## cloglogmodel  9 251.3115
## caumodel      9 252.2136

Regression Models for Multinominal Variable Responses

  1. Multinomial logistic regression: This is a type of regression model used when the dependent variable has three or more unordered categories. The goal of multinomial logistic regression is to predict the probability of each possible outcome category based on one or more predictor variables.

  2. Conditional logistic regression: This is a variant of multinomial logistic regression that accounts for the fact that the observations within each category are not independent.

  3. Multinomial probit regression: This is a type of regression model that is similar to multinomial logistic regression but assumes that the relationship between the predictor variables and the outcome variable follows a normal distribution.

  4. Generalized multinomial logit model: This is a type of regression model that extends the multinomial logistic regression model to allow for more flexible relationships between the predictor variables and the outcome variable.

  5. Discriminant analysis: This is a type of regression model that is used to predict the category of a dependent variable based on one or more predictor variables. Discriminant analysis assumes that the predictor variables follow a normal distribution.