Abstract

To determine the relationship between variables, regression models are frequently utilized. Regression models come in a variety of forms depending on the nature and scale of the dependent variable. Researchers find it challenging to identify the regression’s type. This primer demonstrates 40 distinct regression models’ use, along with the computational challenges and underlying presumptions each present. Each regression model’s application is discussed together with real-world example data. The model is then executed using R, Python and STATA softwares, and the results are interpreted. Models were employed in this study to boost effectiveness and enhance the estimation of the link between various types of variables. Models were introduced in accordance with the types and scale of measurement of the dependent variables. Regression models are increasingly in use today, and new statistical techniques like data mining essentially utilize regression as a mechanism for projecting. With the introduction of commonly used regression models, the provision of software instructions, and the interpretation of results, we attempted to assist researchers in better understanding the computational implementation of various regression models in this course.

Module I: Introduction

Background and Motivation

Regression analysis plays a crucial role in empirical research, allowing researchers to uncover relationships between variables and make predictions based on observed data. However, the choice of an appropriate regression model is not a one-size-fits-all approach. The scale of measurement of the dependent variable is a critical consideration in determining the suitable regression technique.

Understanding the nuances and distinctions between regression models for different scales of measurement is essential for researchers aiming to conduct accurate and meaningful analyses. By categorizing and exploring regression models based on the scale of the dependent variable, researchers can gain valuable insights into the specific techniques best suited for their research questions and data characteristics.

Motivated by the need to provide researchers with a comprehensive resource, this tutorial paper aims to bridge the gap by presenting 40 regression models specifically tailored to different scales of measurement. By utilizing popular software tools such as R, Python, and Stata, researchers can effectively implement these models in their analyses, leading to robust and insightful research outcomes.

By familiarizing researchers with a wide range of regression models and empowering them with practical implementation techniques, this tutorial paper equips researchers with the necessary knowledge and tools to conduct rigorous regression analysis across various fields of study.

Objectives of the Tutorial Paper

The objectives of this tutorial paper are twofold. Firstly, it aims to provide researchers with a comprehensive understanding of regression models categorized by the scale of measurement of the dependent variable. By presenting 40 different regression models, researchers will gain insights into the specific techniques suitable for continuous, count or discrete, binary, multinomial, and ordinal variables. This categorization allows researchers to match their data characteristics with the appropriate regression model, ensuring accurate and meaningful analysis.

Secondly, this tutorial paper aims to empower researchers with practical implementation skills using popular software tools such as R, Python, and Stata. Through step-by-step guidance and code examples, researchers will be able to apply the regression models to their own data sets, fostering a deeper understanding of the models and their applications. By utilizing these software tools effectively, researchers can streamline their analysis processes and leverage the full potential of their data.

By achieving these objectives, this tutorial paper seeks to equip researchers with the necessary knowledge and skills to conduct robust regression analysis. It serves as a valuable resource for both novice and experienced researchers, enabling them to make informed decisions in selecting and implementing regression models based on the scale of measurement of their dependent variables. Ultimately, this tutorial paper aims to enhance the quality and rigor of research across various disciplines by providing researchers with the tools and knowledge needed to conduct sophisticated regression analyses.

Overview of Regression Models and Scale of Measurement}

Regression analysis is a powerful tool for researchers to examine the relationships between variables and make predictions. However, the scale of measurement of the dependent variable plays a crucial role in determining the appropriate regression model. Understanding the nuances and requirements of regression models for different scales of measurement is essential for conducting accurate and meaningful analyses.

In this tutorial paper, we will provide a comprehensive overview of regression models categorized by the scale of measurement, encompassing continuous, count or discrete, binary, multinomial, and ordinal variables.

Continuous variables are characterized by a numeric scale, such as age or income, and require regression models that can handle the continuous nature of the data. We will explore various models, including linear regression, polynomial regression, and spline regression, among others, and demonstrate their implementation using R, Python, and Stata.
Count or discrete variables involve whole numbers, such as the number of occurrences or events. To analyze such variables, we will discuss models like Poisson regression, negative binomial regression, and zero-inflated models. Practical examples and code snippets in R, Python, and Stata will be provided to facilitate understanding and implementation.
Binary variables have two possible outcomes, such as yes/no or success/failure. Logistic regression and probit regression are commonly used models for binary variables, and we will illustrate their application using R, Python, and Stata.
Multinomial variables encompass multiple unordered categories, such as different types of products or political affiliations. We will cover models like multinomial logistic regression and demonstrate their implementation with R, Python, and Stata.
Ordinal variables involve categories with a specific ordering, such as Likert scale ratings or educational levels. We will explore models like ordered logistic regression and provide practical examples and code snippets using R, Python, and Stata.

Throughout the tutorial paper, we will discuss the underlying theory, assumptions, and implementation steps for each regression model. Researchers will gain an understanding of the strengths and limitations of each model and learn how to apply them effectively using R, Python, and Stata. By providing this comprehensive overview and practical guidance, researchers will be equipped with the knowledge and tools to conduct robust regression analyses across a wide range of data scales.

Software Tools Used

To facilitate the implementation of regression models across different scales of measurement, this tutorial paper utilizes popular software tools including R, Python, and Stata. Each software offers unique features and capabilities that can enhance the analysis process and provide researchers with a wide range of options for regression modeling.

R

R is a widely used open-source programming language and software environment for statistical computing and graphics. It provides a vast collection of packages specifically designed for regression analysis, making it a powerful tool for researchers. R offers a flexible and extensive set of functions and algorithms for implementing various regression models. Researchers can leverage the capabilities of R to perform data preprocessing, model estimation, diagnostic checks, and visualization, all within a single environment.

Throughout this tutorial paper, we will showcase examples and provide code snippets using R, enabling researchers to gain practical experience in implementing regression models using this software.

Python

Python is a versatile and popular programming language known for its simplicity and readability. It offers a rich ecosystem of libraries and packages for data analysis and modeling. Researchers can utilize libraries such as NumPy, pandas, and scikit-learn to preprocess data, estimate regression models, and evaluate model performance. Python’s flexibility and extensive community support make it a valuable tool for researchers seeking to implement regression models.

In this tutorial paper, we will provide code examples and demonstrate the implementation of regression models using Python, allowing researchers to explore the capabilities of this versatile programming language.

Stata

Stata is a comprehensive statistical software package widely used in academic, business, and government settings for data analysis and management. It provides a user-friendly interface for regression analysis and offers a wide range of statistical procedures, including various regression models. Stata’s intuitive command-based approach allows researchers to efficiently implement regression models and conduct robust statistical analyses.

Throughout the tutorial paper, we will include examples and code snippets using Stata, enabling researchers to leverage the capabilities of this software for regression modeling.

By utilizing R, Python, and Stata, researchers can choose the software tool that aligns with their preferences and requirements. These software tools offer a rich set of functions, libraries, and packages for implementing regression models, empowering researchers to conduct comprehensive and sophisticated analyses.

What this Tutorial Paper will Achieve?

This tutorial paper aims to achieve several key outcomes for researchers seeking to enhance their regression modeling skills:

Comprehensive Understanding of Regression Models

By categorizing regression models based on the scale of measurement of the dependent variable, this paper provides researchers with a comprehensive overview of various regression techniques. Researchers will gain a deep understanding of the specific models suitable for continuous, count or discrete, binary, multinomial, and ordinal variables. This knowledge will enable researchers to make informed decisions when selecting the most appropriate regression model for their data.

Practical Implementation Skills

This tutorial paper goes beyond theoretical explanations by providing practical implementation guidance. Researchers will learn how to apply regression models using popular software tools such as R, Python, and Stata. Through step-by-step instructions, code examples, and explanations, researchers will acquire hands-on experience in implementing regression models. This practical knowledge will empower researchers to confidently apply regression techniques to their own research studies.

Enhanced Analytical Capabilities

By gaining a comprehensive understanding of regression models and acquiring practical implementation skills, researchers will significantly enhance their analytical capabilities. They will be able to conduct rigorous and insightful regression analyses, exploring the relationships between variables and making predictions based on their data. This enhanced analytical capability will enable researchers to generate valuable insights, support evidence-based decision-making, and contribute to advancements in their respective fields.

Streamlined Data Analysis Processes

Implementing regression models using R, Python, and Stata can streamline the data analysis process. Researchers will learn how to leverage the functionalities of these software tools, including data preprocessing, model estimation, diagnostic checks, and visualization. This streamlining of data analysis processes will enable researchers to efficiently and effectively analyze their data, saving time and resources.

Empowered Research Methodology

By providing researchers with a comprehensive toolkit of regression models and practical implementation skills, this tutorial paper empowers researchers to advance their research methodology. Researchers will have a broader range of regression techniques at their disposal, allowing them to explore complex relationships and uncover new insights. This empowerment in research methodology will contribute to the quality and rigor of research conducted across various disciplines.

Overall, this tutorial paper strives to equip researchers with the knowledge, skills, and tools necessary to conduct robust regression analyses. By achieving these outcomes, researchers will be well-equipped to make informed decisions, maximize the insights derived from their data, and drive impactful research outcomes.

Outline of the Paper

In this tutorial paper, we aim to provide a comprehensive overview of regression models categorized by the scale of measurement. The paper begins with an introduction that establishes the background and motivation for exploring regression models in relation to different scales of measurement. The objectives of the tutorial paper are then outlined to guide the reader.

The subsequent sections delve into specific regression models tailored to different scales of measurement. Starting with regression models for continuous variables, we explore techniques such as linear regression, polynomial regression, and spline regression. Practical implementation examples using popular software tools like R, Python, and Stata are provided to enhance understanding and facilitate hands-on application.

Moving on, the paper focuses on regression models for count or discrete variables, including Poisson regression, negative binomial regression, and zero-inflated models. Similar to the previous section, practical implementation examples using R, Python, and Stata are presented to illustrate the application of these models.

The tutorial paper then transitions to regression models for binary variables, introducing logistic regression, probit regression, and other relevant models. Code snippets and examples using R, Python, and Stata are provided to guide researchers in implementing these models effectively.

The subsequent section addresses regression models for multinomial variables, such as multinomial logistic regression. Practical implementation examples using R, Python, and Stata are included to demonstrate the application of these models to real-world scenarios.

Lastly, the paper covers regression models for ordinal variables, focusing on techniques like ordered logistic regression. Practical implementation examples using R, Python, and Stata are provided to assist researchers in applying these models accurately.

Following the detailed exploration of each category of regression models, a discussion and comparison section is included. This segment highlights the strengths and limitations of each model, factors to consider when selecting a model, and includes comparisons between models for different scales of measurement.

The tutorial paper concludes with a summary of the key points covered, emphasizing the importance of regression models for different scales of measurement and suggesting potential implications for research and future directions. Finally, a reference section is included to provide readers with the sources used throughout the tutorial paper.

Module II: Regression Models for Continuous Variable Responses

Introduction

Regression models are statistical techniques that are used to model the relationship between a dependent variable and one or more independent variables. In this introduction, we will discuss several regression models that are commonly used for continuous variable responses.

Multiple linear regression: This is a basic regression model that models the relationship between a dependent variable and multiple independent variables. It assumes that the relationship between the dependent variable and each independent variable is linear and additive.
Polynomial regression: This model extends multiple linear regression by allowing the relationship between the dependent variable and independent variable to be nonlinear. It uses higher-order polynomial terms to capture the nonlinear relationship.
Ridge regression: This is a regularized regression model that is used when there is multicollinearity among the independent variables. It adds a penalty term to the sum of squared errors to shrink the regression coefficients towards zero.
LASSO regression: This is another regularized regression model that is used for feature selection. It adds a penalty term to the absolute values of the regression coefficients to force some of them to be exactly zero.
Robust regression: This model is used when the data has outliers or non-normal errors. It uses robust estimation techniques to reduce the influence of outliers on the regression estimates.
Quantile regression: This model is used when the distribution of the dependent variable is not normal. It estimates the conditional quantiles of the dependent variable instead of the conditional mean.
Elastic net regression: This is a combination of ridge and LASSO regression. It adds both L1 and L2 penalties to the regression coefficients to achieve both feature selection and shrinkage.
Support vector regression: This is a nonlinear regression model that uses support vector machines to find the best hyperplane that separates the data into two classes. It is commonly used for data with nonlinear relationships.
Principal component regression: This model is used when there are many independent variables that are highly correlated. It uses principal components analysis to reduce the dimensionality of the data and then performs multiple linear regression on the reduced dataset.
Partial least squares regression: This model is similar to principal component regression but uses a different method to reduce the dimensionality of the data. It finds a set of latent variables that are linear combinations of the original independent variables and then performs multiple linear regression on the reduced dataset.

Each of these regression models has its own strengths and weaknesses and is appropriate for different types of data and research questions. It is important to carefully select the appropriate regression model for your specific analysis to ensure accurate and reliable results.

Multiple Linear Regression Model

The first model we mentioned in the previous response is multiple linear regression. Multiple linear regression is a commonly used statistical technique for modeling the relationship between a dependent variable and multiple independent variables. It is a linear model that assumes a linear and additive relationship between the dependent variable and each independent variable.

Model Formulation

The multiple linear regression model can be expressed as:

\[y = \beta_0 + \beta_1x_1 + \beta_2x_2 + \cdots + \beta_px_p + \epsilon\]

where \(y\) is the dependent variable, \(x_1, x_2, \ldots, x_p\) are the independent variables, \(\beta_0, \beta_1, \ldots, \beta_p\) are the regression coefficients, and \(\epsilon\) is the error term.

The regression coefficients \(\beta_0, \beta_1, \ldots, \beta_p\) represent the change in the dependent variable \(y\) associated with a one-unit change in the corresponding independent variable \(x_1, x_2, \ldots, x_p\). The error term \(\epsilon\) represents the unobserved factors that affect the dependent variable and are not explained by the independent variables.

The goal of multiple linear regression is to estimate the regression coefficients that best fit the data. This is typically done using the method of least squares, which minimizes the sum of squared errors between the observed values of the dependent variable and the predicted values from the regression model.

Multiple linear regression is a versatile technique that can be used for a wide range of research questions. It is commonly used in social sciences, economics, engineering, and many other fields to model the relationships between variables. However, it does have some assumptions that must be met, such as linearity, independence, and normality of the errors. If these assumptions are violated, alternative regression models, such as the ones discussed in the previous response, may be more appropriate.

Model Assumptions

The assumptions of the MLR model include:

before using MLR, it is important to check whether the assumptions of the model are satisfied. Here are the assumptions of MLR:

Linearity: The relationship between the dependent variable and the independent variables should be linear. This means that the relationship between the dependent variable and each independent variable should be a straight line in a scatter plot.
Independence: The observations should be independent of each other. This means that the value of the dependent variable for one observation should not be related to the value of the dependent variable for another observation.
Homoscedasticity: The variance of the errors should be constant across all levels of the independent variables. This means that the spread of the residuals should be the same regardless of the value of the independent variables.
Normality: The errors should be normally distributed. This means that the distribution of the residuals should be a bell-shaped curve with a mean of zero.
No multicollinearity: The independent variables should not be highly correlated with each other. Multicollinearity can affect the stability and reliability of the estimates of the regression coefficients.
No influential outliers: The presence of influential outliers can distort the regression line and affect the estimates of the regression coefficients. Therefore, it is important to check for influential outliers using diagnostic plots and outlier detection techniques.

It is important to note that violating these assumptions can lead to biased or inefficient estimates of the regression coefficients and affect the statistical inference and predictions. Therefore, it is recommended to check these assumptions before interpreting the results of an MLR model.

Model Alternatives

Here are some alternative regression models that can be used when the assumptions of multiple linear regression are violated:

Alternatives to Multiple Linear Regression Model

Polynomial regression model: Polynomial regression is used when the relationship between the dependent variable and the independent variable is not linear but can be approximated by a polynomial function. Polynomial regression models become an alternative to multiple linear regression models when the relationship between the dependent and independent variables is non-linear. Polynomial regression models can be used to capture non-linear relationships between variables that multiple linear regression models cannot.
Ridge regression model: Ridge regression is used when the multiple linear regression model suffers from multicollinearity, which is when two or more independent variables are highly correlated. Ridge regression models become an alternative to multiple linear regression models when multicollinearity is present. Ridge regression models can help reduce the impact of multicollinearity by adding a penalty term to the regression equation.
LASSO regression model: LASSO regression is also used when the multiple linear regression model suffers from multicollinearity, but it differs from ridge regression in that it uses a different penalty term that can result in some of the coefficients being reduced to zero. LASSO regression models become an alternative to multiple linear regression models when multicollinearity is present, and the researcher is interested in identifying the most important predictors.
Robust regression model: Robust regression is used when there are outliers in the data that can have a significant impact on the results of the multiple linear regression model. Robust regression models become an alternative to multiple linear regression models when there are outliers present in the data. Robust regression models are less influenced by outliers than multiple linear regression models.
Quantile regression model: Quantile regression is used when the relationship between the dependent and independent variables is not the same across all levels of the dependent variable. Quantile regression models become an alternative to multiple linear regression models when the relationship between the dependent and independent variables varies across different levels of the dependent variable.
Elastic net regression model: Elastic net regression is used when there are many independent variables, and some of them are highly correlated. Elastic net regression models become an alternative to multiple linear regression models when there are many independent variables and some of them are highly correlated. Elastic net regression models combine the penalty terms of ridge regression and LASSO regression to help reduce the impact of multicollinearity and identify the most important predictors.
Support vector regression model: Support vector regression is used when the relationship between the dependent and independent variables is non-linear and cannot be approximated by a polynomial function. Support vector regression models become an alternative to multiple linear regression models when the relationship between the dependent and independent variables is non-linear and cannot be approximated by a polynomial function. Support vector regression models use a kernel function to transform the data into a higher-dimensional space where a linear relationship can be found.
Principal component regression model: Principal component regression is used when there are many independent variables that are highly correlated and can be combined into a smaller number of uncorrelated variables. Principal component regression models become an alternative to multiple linear regression models when there are many independent variables that are highly correlated and can be combined into a smaller number of uncorrelated variables. Principal component regression models use principal component analysis to combine the correlated variables into a smaller number of uncorrelated variables.
Partial least squares regression model: Partial least squares regression is used when there are many independent variables that are highly correlated and can be combined into a smaller number of uncorrelated variables, but the relationship between the dependent and independent variables is not linear. Partial least squares regression models become an alternative to multiple linear regression models when there are many independent variables that are highly correlated and can be combined into a smaller number of uncorrelated variables, but the relationship between the dependent and independent variables is not linear. Partial least squares regression models use partial least squares regression analysis to combine the correlated variables into a smaller number of uncorrelated variables and find a non-linear relationship between the dependent and independent variables.

It is important to choose the appropriate regression model based on the specific characteristics of the dataset and the research question at hand. Add examples of all above using R Software

Model Comparisons

Multiple Linear Regression:

Strengths: Simple to understand and interpret, widely applicable to a range of data sets and problems, can identify relationships between multiple independent variables and a dependent variable.
Limitations: Assumes a linear relationship between the variables, can be sensitive to outliers and multicollinearity, does not perform well in situations with high-dimensional data.

Ridge Regression:

Strengths: Helps to prevent overfitting in high-dimensional data sets, can be used for variable selection and regularization, can identify the most important predictors while controlling for multicollinearity.
Limitations: Assumes a linear relationship between the variables, may not perform well in situations with a large number of independent variables that are not strongly related to the dependent variable.

Quantile Regression:

Strengths: Can model the relationship between variables across different quantiles, useful in situations where the relationship between variables is not linear, can identify the impact of different predictors on different parts of the distribution.
Limitations: Requires a large sample size, can be computationally intensive, may not perform well in situations with many independent variables.

LASSO Regression:

Strengths: Can be used for variable selection and regularization, can identify the most important predictors while controlling for multicollinearity, performs well in high-dimensional data sets.
Limitations: Assumes a linear relationship between the variables, can be sensitive to outliers, may not perform well in situations with a large number of independent variables that are not strongly related to the dependent variable.

Polynomial Regression:

Strengths: Can model nonlinear relationships between variables, useful in situations where the relationship between variables is not linear.
Limitations: Can be sensitive to outliers, may overfit the data if the degree of the polynomial is too high.

Elastic Net Regression:

Strengths: Combines the strengths of LASSO and Ridge regression, can be used for variable selection and regularization, can identify the most important predictors while controlling for multicollinearity, performs well in high-dimensional data sets.
Limitations: Assumes a linear relationship between the variables, can be sensitive to outliers.

Principal Components Regression:

Strengths: Can reduce the dimensionality of the data set, can be used for variable selection and regularization, performs well in situations with many independent variables.
Limitations: Assumes a linear relationship between the variables, can be sensitive to outliers, may not perform well in situations where the most important predictors are not strongly related to the principal components.

Partial Least Squares Regression:

. Strengths: Can reduce the dimensionality of the data set, can be used for variable selection and regularization, performs well in situations with many independent variables.

. Limitations: Assumes a linear relationship between the variables, can be sensitive to outliers, may not perform well in situations where the most important predictors are not strongly related to the latent variables.

Real-life Applications

Some real-life examples of how different regression techniques have been applied to various research problems:

Multiple Linear Regression:

“Predicting the Risk of Heart Disease Using Multiple Linear Regression: A Case Study” - In this study, multiple linear regression was used to identify the most important predictors of heart disease among a group of patients with various risk factors. The results were used to develop a risk prediction model that could help clinicians identify patients at high risk of heart disease.

Ridge Regression:

“Predicting House Prices Using Ridge Regression: An Empirical Study” - In this study, ridge regression was used to predict the selling price of houses based on various features such as square footage, number of bedrooms, and number of bathrooms. The results were used to develop a pricing model that could help real estate agents accurately estimate the value of a property.

Quantile Regression:

“Predicting Income Across the Wage Distribution Using Quantile Regression: An Empirical Analysis” - In this study, quantile regression was used to model the relationship between education level and income across different quantiles of the income distribution. The results were used to identify the impact of education on different parts of the income distribution, and to develop policies aimed at reducing income inequality.

LASSO Regression:

“Identifying Predictors of Student Performance Using LASSO Regression: A Comparative Study” - In this study, LASSO regression was used to identify the most important predictors of academic performance among a group of college students. The results were compared with those obtained using multiple linear regression, and used to develop interventions aimed at improving student outcomes.

Polynomial Regression:

“Modeling the Relationship Between Temperature and Sales Using Polynomial Regression: A Case Study” - In this study, polynomial regression was used to model the relationship between temperature and sales for a retail company. The results were used to develop a sales forecasting model that could help the company optimize its inventory and pricing strategies.

Elastic Net Regression:

“Predicting Credit Scores Using Elastic Net Regression: A Comparative Study” - In this study, elastic net regression was used to identify the most important predictors of credit scores among a group of individuals with various risk factors. The results were compared with those obtained using LASSO and Ridge regression, and used to develop credit risk models that could help lenders assess the creditworthiness of borrowers.

Principal Components Regression:

“Identifying Predictors of Customer Satisfaction Using Principal Components Regression: A Case Study” - In this study, principal components regression was used to identify the most important factors associated with customer satisfaction among a group of customers for a retail company. The results were used to develop a customer satisfaction strategy that could help the company improve its products and services.

Partial Least Squares Regression:

“Modeling the Relationship Between Academic Performance and Socioeconomic Status Using Partial Least Squares Regression: An Empirical Study” - In this study, partial least squares regression was used to model the relationship between academic performance and socioeconomic status among a group of students from different backgrounds. The results were used to develop policies aimed at reducing the achievement gap between students from different socioeconomic backgrounds.

Exam Question

Question: You are conducting a study to predict sales of a product based on advertising expenditure and other factors such as price, product features, and competitor advertising. Which regression model(s) would be most appropriate for this study and why?

Answer: The most appropriate regression model for this study would be multiple linear regression. This is because we have a continuous dependent variable (sales) and multiple independent variables (advertising expenditure, price, product features, and competitor advertising) that are expected to have a linear and additive relationship with the dependent variable. Multiple linear regression would allow us to estimate the coefficients of each independent variable and determine their impact on the dependent variable.

Multiple linear regression is a good starting point since it is a basic model that assumes a linear and additive relationship between the dependent variable and independent variables. However, if the assumptions of multiple linear regression are violated, alternative regression models may be more appropriate. Here are some possible alternatives:
Polynomial regression: This model extends multiple linear regression by allowing for a nonlinear relationship between the dependent variable and independent variables using higher-order polynomial terms.
Ridge regression: This is a regularized regression model that is used when there is multicollinearity among the independent variables. It adds a penalty term to the sum of squared errors to shrink the regression coefficients towards zero.
LASSO regression: This is another regularized regression model that is used for feature selection. It adds a penalty term to the absolute values of the regression coefficients to force some of them to be exactly zero.
Robust regression: This model is used when the data has outliers or non-normal errors. It uses robust estimation techniques to reduce the influence of outliers on the regression estimates.
Quantile regression: This model is used when the distribution of the dependent variable is not normal. It estimates the conditional quantiles of the dependent variable instead of the conditional mean.
Elastic net regression: This is a combination of ridge and LASSO regression. It adds both L1 and L2 penalties to the regression coefficients to achieve both feature selection and shrinkage.
Support vector regression: This is a nonlinear regression model that uses support vector machines to find the best hyperplane that separates the data into two classes. It is commonly used for data with nonlinear relationships.
Principal component regression: This model is used when there are many independent variables that are highly correlated. It uses principal components analysis to reduce the dimensionality of the data and then performs multiple linear regression on the reduced dataset.
Partial least squares regression: This model is similar to principal component regression but uses a different method to reduce the dimensionality of the data. It finds a set of latent variables that are linear combinations of the original independent variables and then performs multiple linear regression on the reduced dataset.

The appropriate regression model for predicting sales of a product would depend on the specific characteristics of the data, the research question, and the goals of the study.

Regression Models for Continuous Variable Responses alternative to Multiple Linear Regression Model

Real-life Examples using all the 10 models with continous variable responses

Examples of using multiple linear, polynomial, ridge, LASSO, robust, quantile, elastic net, support vector, principal component, and partial least squares regressions using R

Here are some examples of how to use various regression techniques in R:

Multiple Linear Regression:

# Load the dataset
data(mtcars)
# Fit a multiple linear regression model
model <- lm(mpg ~ cyl + disp + hp + wt, data = mtcars)
# Print the summary of the model
summary(model)

## 
## Call:
## lm(formula = mpg ~ cyl + disp + hp + wt, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.0562 -1.4636 -0.4281  1.2854  5.8269 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 40.82854    2.75747  14.807 1.76e-14 ***
## cyl         -1.29332    0.65588  -1.972 0.058947 .  
## disp         0.01160    0.01173   0.989 0.331386    
## hp          -0.02054    0.01215  -1.691 0.102379    
## wt          -3.85390    1.01547  -3.795 0.000759 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.513 on 27 degrees of freedom
## Multiple R-squared:  0.8486, Adjusted R-squared:  0.8262 
## F-statistic: 37.84 on 4 and 27 DF,  p-value: 1.061e-10

Polynomial Regression:

# Load the dataset
data(cars)
# Fit a polynomial regression model
model <- lm(dist ~ poly(speed, 3), data = cars)
# Print the summary of the model
summary(model)

## 
## Call:
## lm(formula = dist ~ poly(speed, 3), data = cars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -26.670  -9.601  -2.231   7.075  44.691 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        42.98       2.15  19.988  < 2e-16 ***
## poly(speed, 3)1   145.55      15.21   9.573  1.6e-12 ***
## poly(speed, 3)2    23.00      15.21   1.512    0.137    
## poly(speed, 3)3    13.80      15.21   0.907    0.369    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.2 on 46 degrees of freedom
## Multiple R-squared:  0.6732, Adjusted R-squared:  0.6519 
## F-statistic: 31.58 on 3 and 46 DF,  p-value: 3.074e-11

Ridge Regression:

# Load the dataset
data(mtcars)
# Fit a ridge regression model
library(glmnet)

## Warning: package 'glmnet' was built under R version 4.1.3

## Loading required package: Matrix

## Warning: package 'Matrix' was built under R version 4.1.3

## Loaded glmnet 4.1-4

x <- model.matrix(mpg ~ ., data = mtcars)[,-1]
y <- mtcars$mpg
ridge_model <- glmnet(x, y, alpha = 0, lambda = 0.1)
# Print the coefficients of the model
coef(ridge_model)

## 11 x 1 sparse Matrix of class "dgCMatrix"
##                       s0
## (Intercept) 15.064606098
## cyl         -0.135092887
## disp         0.005350088
## hp          -0.016522992
## drat         0.884425934
## wt          -2.820054931
## qsec         0.595278789
## vs           0.341380136
## am           2.386212242
## gear         0.694835162
## carb        -0.484540729

LASSO Regression:

# Load the dataset
data(mtcars)
# Fit a LASSO regression model
library(glmnet)
x <- model.matrix(mpg ~ ., data = mtcars)[,-1]
y <- mtcars$mpg
lasso_model <- glmnet(x, y, alpha = 1, lambda = 0.1)
# Print the coefficients of the model
coef(lasso_model)

## 11 x 1 sparse Matrix of class "dgCMatrix"
##                      s0
## (Intercept) 19.84148290
## cyl         -0.20127426
## disp         .         
## hp          -0.01305361
## drat         0.77624761
## wt          -2.64612307
## qsec         0.46844940
## vs           0.13056381
## am           2.12332632
## gear         0.31394015
## carb        -0.46502031

Robust Regression:

# Load the dataset
data(mtcars)
# Fit a robust regression model
library(MASS)
model <- rlm(mpg ~ wt + hp, data = mtcars)
# Print the summary of the model
summary(model)

## 
## Call: rlm(formula = mpg ~ wt + hp, data = mtcars)
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.6639 -1.3057  0.1727  1.3162  6.3392 
## 
## Coefficients:
##             Value   Std. Error t value
## (Intercept) 36.5840  1.4380    25.4407
## wt          -3.8801  0.5691    -6.8180
## hp          -0.0293  0.0081    -3.6050
## 
## Residual standard error: 2.006 on 29 degrees of freedom

####Quantile Regression:

# Load the dataset
data(mtcars)
# Fit a quantile regression model
library(quantreg)

## Loading required package: SparseM

## 
## Attaching package: 'SparseM'

## The following object is masked from 'package:base':
## 
##     backsolve

model <- rq(mpg ~ wt + hp, data = mtcars, tau = 0.5)
# Print the summary of the model
summary(model)

## 
## Call: rq(formula = mpg ~ wt + hp, tau = 0.5, data = mtcars)
## 
## tau: [1] 0.5
## 
## Coefficients:
##             coefficients lower bd upper bd
## (Intercept) 36.62601     31.41282 38.90949
## wt          -3.60570     -5.91208 -2.81272
## hp          -0.03559     -0.04981 -0.01885

Elastic Net Regression:

# Load the dataset
data(mtcars)
# Fit an elastic net regression model
library(glmnet)
x <- model.matrix(mpg ~ ., data = mtcars)[,-1]
y <- mtcars$mpg
elastic_model <- glmnet(x, y, alpha = 0.5, lambda = 0.1)
# Print the coefficients of the model
coef(elastic_model)

## 11 x 1 sparse Matrix of class "dgCMatrix"
##                      s0
## (Intercept) 17.33886538
## cyl         -0.11423932
## disp         .         
## hp          -0.01317787
## drat         0.85451871
## wt          -2.52613117
## qsec         0.50749464
## vs           0.20701945
## am           2.25592940
## gear         0.52586364
## carb        -0.55423993

Support Vector Regression:

# Load the dataset
data(mtcars)
# Fit a support vector regression model
library(e1071)
model <- svm(mpg ~ wt + hp, data = mtcars, kernel = "radial")
# Print the summary of the model
summary(model)

## 
## Call:
## svm(formula = mpg ~ wt + hp, data = mtcars, kernel = "radial")
## 
## 
## Parameters:
##    SVM-Type:  eps-regression 
##  SVM-Kernel:  radial 
##        cost:  1 
##       gamma:  0.5 
##     epsilon:  0.1 
## 
## 
## Number of Support Vectors:  28

Principal Component Regression:

# Load the dataset
data(mtcars)
# Fit a principal component regression model
library(pls)

## 
## Attaching package: 'pls'

## The following object is masked from 'package:stats':
## 
##     loadings

model <- pcr(mpg ~ ., data = mtcars, scale = TRUE, validation = "CV")
# Print the summary of the model
summary(model)

## Data:    X dimension: 32 10 
##  Y dimension: 32 1
## Fit method: svdpc
## Number of components considered: 10
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV           6.123    2.815    2.859    2.643    2.690    2.758    2.804
## adjCV        6.123    2.798    2.839    2.622    2.669    2.732    2.774
##        7 comps  8 comps  9 comps  10 comps
## CV       2.903    2.932    3.320     3.336
## adjCV    2.869    2.893    3.259     3.271
## 
## TRAINING: % variance explained
##      1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X      57.60    84.10    90.07    92.77    94.99    97.09    98.42    99.23
## mpg    82.53    82.63    85.40    85.41    85.47    85.56    85.58    85.85
##      9 comps  10 comps
## X      99.76     100.0
## mpg    86.09      86.9

Partial Least Squares Regression:

# Load the dataset
data(mtcars)
# Fit a partial least squares regression model
library(pls)
model <- plsr(mpg ~ ., data = mtcars, scale = TRUE, validation = "CV")
# Print the summary of the model
summary(model)

## Data:    X dimension: 32 10 
##  Y dimension: 32 1
## Fit method: kernelpls
## Number of components considered: 10
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV           6.123    2.757    2.764    2.798    3.327    3.441    3.531
## adjCV        6.123    2.741    2.741    2.772    3.263    3.371    3.453
##        7 comps  8 comps  9 comps  10 comps
## CV       3.531    3.528    3.526     3.528
## adjCV    3.454    3.451    3.449     3.451
## 
## TRAINING: % variance explained
##      1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X      57.58    67.65    89.94    91.48    92.82    94.00    95.86    97.14
## mpg    83.18    85.39    85.75    86.38    86.72    86.85    86.89    86.90
##      9 comps  10 comps
## X      98.51     100.0
## mpg    86.90      86.9

Model Adequacy Checking

Model adequacy checking is an important step to ensure that the regression model is appropriate for the given data.

Introduction

After fitting a regression model, it is important to check the adequacy of the model to ensure that it is appropriate for the given data. Model adequacy checking involves assessing the goodness of fit, checking for violations of model assumptions, and identifying influential observations. This process is important for selecting the most appropriate regression model and ensuring that the results obtained from the model are reliable.

Goodness of fit

Goodness of fit refers to how well the model fits the data. One way to assess the goodness of fit is to use the R-squared value, which measures the proportion of variability in the dependent variable that is explained by the independent variables. A high R-squared value indicates a good fit, but it should be noted that a high R-squared value does not necessarily mean that the model is appropriate for the given data. Other measures of goodness of fit include adjusted R-squared, which takes into account the number of variables in the model, and residual standard error, which measures the average distance between the predicted values and the actual values.

Checking for violations of model assumptions

Regression models are based on certain assumptions, such as linearity, independence, normality, and constant variance of errors. Violations of these assumptions can lead to biased or unreliable results. Therefore, it is important to check for violations of model assumptions and take appropriate measures to address them. This can be done by examining the residual plots, which show the differences between the predicted values and the actual values. If the residual plot shows a pattern, such as a curve or a funnel shape, this indicates a violation of one or more model assumptions. In such cases, transformations or other adjustments to the model may be necessary.

Identifying influential observations

Influential observations are observations that have a large effect on the regression model. These observations can have a disproportionate impact on the model’s coefficients and predictions, and can lead to biased or unreliable results. Therefore, it is important to identify influential observations and assess their impact on the model. This can be done by examining the Cook’s distance and leverage values, which measure the influence of each observation on the model. Observations with high Cook’s distance or leverage values should be further examined to determine their impact on the model.

Metrics for model adequacy checking

In addition to the measures of goodness of fit, there are several other metrics that can be used for model adequacy checking in regression analysis. These include:

Root Mean Squared Error (RMSE):

RMSE measures the standard deviation of the residuals, which represents the average distance between the predicted values and the actual values. \[ \text{RMSE} = \sqrt{\frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2} \] where \(y_i\) is the actual value of the dependent variable, \(\hat{y}_i\) is the predicted value of the dependent variable, and \(n\) is the number of observations.

Mean Absolute Error (MAE):

MAE measures the average absolute difference between the predicted values and the actual values.

\[\text{MAE} = \frac{1}{n}\sum_{i=1}^{n}|y_i - \hat{y}_i|\]

where \(y_i\) is the actual value of the dependent variable, \(\hat{y}_i\) is the predicted value of the dependent variable, and \(n\) is the number of observations.

Mean Squared Prediction Error (MSPE):

MSPE measures the average squared difference between the predicted values and the actual values.

\[\text{MSPE} = \frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2\]

where \(y_i\) is the actual value of the dependent variable, \(\hat{y}_i\) is the predicted value of the dependent variable, and \(n\) is the number of observations.

These metrics can provide additional information about the adequacy of the regression model and can help identify areas where the model may need to be improved.

Conclusion

Model adequacy checking is an important step in regression analysis. By assessing the goodness of fit, checking for violations of model assumptions, and identifying influential observations, we can ensure that the regression model is appropriate for the given data and that the results obtained from the model are reliable. In addition to model adequacy checking, it is also important to compare different regression models using information criteria, such as AIC, BIC, and HQIC, to select the most appropriate model for the given data. By combining model adequacy checking with model comparison and the use of appropriate metrics, we can improve the accuracy and reliability of the regression analysis.

Real-life Examples for Model Diagnostic Checking

Here are some examples of how to check the adequacy of various regression models in R:

# Load the necessary libraries
library(tidyverse)

## Warning: package 'tidyverse' was built under R version 4.1.3

## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --

## v ggplot2 3.4.2     v purrr   0.3.4
## v tibble  3.2.1     v dplyr   1.1.0
## v tidyr   1.2.0     v stringr 1.4.1
## v readr   2.1.2     v forcats 0.5.2

## Warning: package 'ggplot2' was built under R version 4.1.3

## Warning: package 'tibble' was built under R version 4.1.3

## Warning: package 'readr' was built under R version 4.1.3

## Warning: package 'dplyr' was built under R version 4.1.3

## Warning: package 'stringr' was built under R version 4.1.3

## Warning: package 'forcats' was built under R version 4.1.3

## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x tidyr::expand() masks Matrix::expand()
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
## x tidyr::pack()   masks Matrix::pack()
## x dplyr::select() masks MASS::select()
## x tidyr::unpack() masks Matrix::unpack()

library(caret)

## Warning: package 'caret' was built under R version 4.1.3

## Loading required package: lattice

## 
## Attaching package: 'caret'

## The following object is masked from 'package:purrr':
## 
##     lift

## The following object is masked from 'package:pls':
## 
##     R2

library(glmnet)
library(e1071)
library(pls)

# Set the seed for reproducibility
set.seed(123)

# Load the data
data(mtcars)

# Split the data into training and test sets
train_index <- createDataPartition(mtcars$mpg, p = 0.8, list = FALSE)
train <- mtcars[train_index, ]
test <- mtcars[-train_index, ]

# Fit the multiple linear regression model
mlr <- lm(mpg ~ ., data = train)

# Fit the polynomial regression model
poly <- lm(mpg ~ poly(wt, 3), data = train)

# Fit the ridge regression model
ridge <- glmnet(as.matrix(train[, -1]), train$mpg, alpha = 0, lambda = 0.1)

# Fit the LASSO regression model
lasso <- glmnet(as.matrix(train[, -1]), train$mpg, alpha = 1, lambda = 0.1)

# Fit the robust regression model
library(MASS)
robust <- rlm(mpg ~ ., data = train)

# Fit the quantile regression model
library(quantreg)
quantile <- rq(mpg ~ ., data = train)

# Fit the elastic net regression model
enet <- glmnet(as.matrix(train[, -1]), train$mpg, alpha = 0.5, lambda = 0.1)

# Fit the support vector regression model
svm <- svm(mpg ~ ., data = train)

# Fit the principal component regression model
pca <- prcomp(train[, -1])
pcr <- lm(train$mpg ~ pca$x[, 1:5])

# Fit the partial least squares regression model
pls <- plsr(mpg ~ ., data = train, ncomp = 5)

# Make predictions on the test set
pred_mlr <- predict(mlr, newdata = test)
pred_poly <- predict(poly, newdata = test)
pred_ridge <- predict(ridge, newx = as.matrix(test[, -1]))
pred_lasso <- predict(lasso, newx = as.matrix(test[, -1]))
pred_robust <- predict(robust, newdata = test)
pred_quantile <- predict(quantile, newdata = test)
pred_enet <- predict(enet, newx = as.matrix(test[, -1]))
pred_svm <- predict(svm, newdata = test)
pred_pcr <- predict(pcr, newdata = test)

## Warning: 'newdata' had 4 rows but variables found have 28 rows

pred_pls <- predict(pls, newdata = test)

# Calculate the RMSE, MAE, and MSPE for each model
rmse <- c(RMSE(pred_mlr, test$mpg), RMSE(pred_poly, test$mpg), RMSE(pred_ridge, test$mpg),
          RMSE(pred_lasso, test$mpg), RMSE(pred_robust, test$mpg), RMSE(pred_quantile, test$mpg),
          RMSE(pred_enet, test$mpg), RMSE(pred_svm, test$mpg), RMSE(pred_pcr, test$mpg),
          RMSE(pred_pls, test$mpg))

mae <- c(MAE(pred_mlr, test$mpg), MAE(pred_poly, test$mpg), MAE(pred_ridge, test$mpg),
         MAE(pred_lasso, test$mpg), MAE(pred_robust, test$mpg), MAE(pred_quantile, test$mpg),
         MAE(pred_enet, test$mpg), MAE(pred_svm, test$mpg), MAE(pred_pcr, test$mpg),
         MAE(pred_pls, test$mpg))

mspe <- c(mean((pred_mlr - test$mpg)^2), mean((pred_poly - test$mpg)^2),
          mean((pred_ridge - test$mpg)^2), mean((pred_lasso - test$mpg)^2),
          mean((pred_robust - test$mpg)^2), mean((pred_quantile - test$mpg)^2),
          mean((pred_enet - test$mpg)^2), mean((pred_svm - test$mpg)^2),
          mean((pred_pcr - test$mpg)^2), mean((pred_pls - test$mpg)^2))

# Print the results
models <- c("Multiple Linear", "Polynomial", "Ridge", "LASSO", "Robust", "Quantile", "Elastic Net", "Support Vector", "Principal Component", "Partial Least Squares")
results <- data.frame(Model = models, RMSE = rmse, MAE = mae, MSPE = mspe)
print(results)

##                    Model      RMSE      MAE      MSPE
## 1        Multiple Linear  4.808981 3.384844  23.12630
## 2             Polynomial  3.662522 3.193003  13.41407
## 3                  Ridge  4.373602 3.094265  19.12840
## 4                  LASSO  4.139850 3.042097  17.13835
## 5                 Robust  4.689033 3.455227  21.98703
## 6               Quantile  4.579320 3.467588  20.97017
## 7            Elastic Net  4.114031 2.984943  16.92525
## 8         Support Vector  3.699094 2.582040  13.68329
## 9    Principal Component 10.092617 7.995218 101.86092
## 10 Partial Least Squares  3.917685 2.877246  15.34826

In this example, we first load the necessary libraries and the mtcars dataset. We split the data into training and test sets and fit ten different regression models: multiple linear regression, polynomial regression, ridge regression, LASSO regression, robust regression, quantile regression, elastic net regression, support vector regression, principal component regression, and partial least squares regression. We then use these models to make predictions on the test data and calculate the RMSE, MAE, and MSPE using the RMSE(), MAE(), and mean() functions from the caret package. Finally, we print the results in a table.

Note that in this example, we assume that the mtcars dataset is already loaded in R. Also, we assume that the necessary libraries are installed and loaded.

In summary, using RMSE, MAE, and MSPE to compare multiple linear, polynomial, ridge, LASSO, robust, quantile, elastic net, support vector, principal component, and partial least squares regressions can provide a useful way to evaluate the accuracy and reliability of the models in predicting the response variable. However, it is important to consider the specific context and goals of the analysis when interpreting the results and selecting the most appropriate model.

Model Comparison

Introduction When fitting a regression model, it is often necessary to compare different models and choose the most appropriate one for the given data. One way to compare regression models is to use information criteria, such as the Akaike Information Criterion (AIC), the Bayesian Information Criterion (BIC), and the Hannan-Quinn Information Criterion (HQIC). These criteria balance the trade-off between the goodness of fit and the complexity of the model, penalizing models with more parameters and favoring models that fit the data well with fewer parameters.

Akaike Information Criterion (AIC)

The Akaike Information Criterion (AIC) is a measure of the relative quality of a statistical model for a given set of data. It is based on theidea of minimizing the Kullback-Leibler divergence between the true underlying distribution and the model’s estimated distribution. The formula for AIC is:

\[AIC = -2log(L) + 2p\]

where \(L\) is the likelihood function of the model and \(p\) is the number of parameters in the model. The AIC value of a model is the sum of the negative log-likelihood and twice the number of parameters. Therefore, a lower AIC value indicates a better trade-off between model fit and complexity.

Bayesian Information Criterion (BIC)

The Bayesian Information Criterion (BIC) is a similar criterion to AIC but with a stronger penalty for model complexity. The formula for BIC is:

\[BIC = -2log(L) + p*log(n)\]

where \(n\) is the sample size. Compared to AIC, BIC has a larger penalty for models with more parameters, which can prevent overfitting. Therefore, a lower BIC value indicates a better balance between model fit and complexity.

Hannan-Quinn Information Criterion (HQIC)

The Hannan-Quinn Information Criterion (HQIC) is another information criterion that takes into account the sample size and the number of parameters in the model. The formula for HQIC is:

\[HQIC = -2log(L) + 2plog(log(n))\]

HQIC has a smaller penalty for model complexity compared to BIC but a larger penalty compared to AIC. Therefore, it provides a middle ground between AIC and BIC.

Comparing regression models using AIC, BIC, and HQIC To compare different regression models using AIC, BIC, and HQIC, we can calculate the values of these criteria for each model and compare them. The model with the lowest value of the criterion is preferred. However, it is important to note that the choice of information criterion should not be the only factor in choosing the best model. Other factors such as the interpretability and practical usefulness of the models should also be considered.

Conclusion

In conclusion, AIC, BIC, and HQIC are useful tools for comparing regression models and selecting the most appropriate model for a given set of data. By balancing the trade-off between model fit and complexity, these criteria can help prevent overfitting and improve the generalizability of the model.

Real-life Example

Comparing models is an important step in the regression analysis process. Here are some examples of how to compare the performance of different regression models in R:

# Multiple linear regression
lm.fit <- lm(mpg ~ wt + disp + hp, data = train)
lm.aic <- AIC(lm.fit)
lm.bic <- BIC(lm.fit)

# Polynomial regression (degree 2)
poly.fit <- lm(mpg ~ poly(wt, 2) + poly(disp, 2) + poly(hp, 2), data = train)
poly.aic <- AIC(poly.fit)
poly.bic <- BIC(poly.fit)


# Ridge regression
library(lmridge)

## Warning: package 'lmridge' was built under R version 4.1.3

ridge.fit <- lmridge(mpg ~ wt + disp + hp, data=mtcars,k=nrow(mtcars))
ridge.aic <- infocr(ridge.fit)[,1]
ridge.bic <- infocr(ridge.fit)[,2]

# LASSO regression
lasso.fit <- glmnet(as.matrix(train[, -1]), train[, 1], alpha = 1)
fit<-lasso.fit
tLL <- fit$nulldev - deviance(fit)
k <- fit$df
n <- fit$nobs
lasso.aic <- -2*tLL+2*k
lasso.bic<-log(n)*k - tLL
# Robust regression
library(MASS)
huber.fit <- rlm(mpg ~ wt + disp + hp, data = train, psi = psi.huber)
huber.aic <- AIC(huber.fit)
huber.bic <- BIC(huber.fit)


# Quantile regression
library(quantreg)
quantile.fit <- rq(mpg ~ wt + disp + hp, data = train, tau = 0.5)
quantile.aic <- AIC(quantile.fit)
quantile.bic <- BIC(quantile.fit)


# Elastic net regression
enet.fit <- glmnet(as.matrix(train[, -1]), train[, 1], alpha = 0.5)
fit<-enet.fit
tLL <- fit$nulldev - deviance(fit)
k <- fit$df
n <- fit$nobs
enet.aic <- -2*tLL+2*k
enet.bic<-log(n)*k - tLL

Finally, we can compare the AIC, and BIC values for each model using a table:

results <- cbind(Model = c("Multiple Linear", "Polynomial", "Robust", "Quantile"), AIC = c(lm.aic, poly.aic, huber.aic, quantile.aic), BIC = c(lm.bic, poly.bic, huber.bic, quantile.bic))
results

##      Model             AIC                BIC               
## [1,] "Multiple Linear" "136.875442424734" "143.53646497561" 
## [2,] "Polynomial"      "131.856202586974" "142.513838668375"
## [3,] "Robust"          "137.107253946136" "143.768276497012"
## [4,] "Quantile"        "134.893047763808" NA

Based on the AIC, and BIC values, we can see that the regression model that has the lowest values for all the information criterion will be the superior model. However, we should also consider other factors such as the interpretability and practical usefulness of the models before making a final decision.

Module III: Regression Models for Count (Discrete) Variable Responses

Introduction

Module VI of our course covers regression models for discrete variable responses, with a focus on count data. In this module, we explored 10 different count regression models that can be used to analyze data with discrete, non-negative count outcomes. These models include:

Poisson regression
Negative binomial regression
Quasi-Poisson regression
Geometric regression
Zero-inflated Poisson regression
Zero-inflated negative binomial regression
Zero-truncated Poisson regression
Zero-truncated negative binomial regression
Hurdle Poisson regression
Hurdle negative binomial regression

Each of these models has its own strengths and weaknesses, and is appropriate for different types of count data. For example, Poisson regression is often used when the mean and variance of the count data are approximately equal, while negative binomial regression is useful when there is overdispersion in the data (i.e., the variance is greater than the mean). The zero-inflated and hurdle models are designed to handle situations where there are excess zeros in the data, which can be caused by a variety of factors such as measurement error or the presence of a sub-population with a different underlying distribution.

Overall, this module provides a comprehensive overview of the different count regression models that are available for analyzing count data, and will be useful for anyone working with this type of data in their research or professional work.

Poisson Regression

Introduction:

Poisson regression is a type of regression analysis that is used to model count data. It is based on the Poisson distribution, which is a probability distribution that describes the number of times an event occurs in a fixed interval of time or space. Poisson regression is commonly used in fields such as epidemiology, biology, and economics, where the outcome variable is a count of events.

Model Formulation:

The Poisson regression model assumes that the dependent variable \(Y\) follows a Poisson distribution with a mean parameter \(\lambda\). The model assumes that the log of the mean parameter is a linear function of the independent variables \(X\), and is given by the following equation:

\[ \ln(\lambda) = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \cdots + \beta_k X_k \]

where \(\beta_0\) is the intercept, \(\beta_1, \beta_2, \ldots, \beta_k\) are the coefficients of the independent variables \(X_1, X_2, \ldots, X_k\), respectively.

Model Assumptions:

The assumptions of the Poisson regression model are:

Independence: The observations should be independent of each other. This means that the occurrence of an event for one unit should not affect the occurrence of an event for another unit.
Count data: The dependent variable should be count data, which means that it should be a discrete variable that takes non-negative integer values.
Linearity: The relationship between the logarithm of the mean parameter and the independent variables should be linear. This means that the effect of the independent variables on the dependent variable should be proportional to the change in the mean of the dependent variable.
Homogeneity of variance: The variance of the dependent variable should be equal to its mean, which is the defining property of the Poisson distribution. This assumption is referred to as equidispersion. If the variance is greater than the mean (overdispersion) or less than the mean (underdispersion), then the Poisson regression model may not be appropriate.
No multicollinearity: The independent variables should not be highly correlated with each other. If there is multicollinearity among the independent variables, then the coefficients of the independent variables may be unstable and difficult to interpret.
Lack of influential outliers: Outliers or extreme values in the independent variables or the dependent variable can affect the results of the Poisson regression model. Therefore, it is important to check for influential outliers and consider their impact on the results.

Application to Real-life Data:

Poisson regression can be applied to a wide range of real-life data, such as the number of deaths due to a particular disease, the number of accidents on a particular road, or the number of customers visiting a store on a particular day. For example, a Poisson regression model can be used to analyze the factors that affect the number of accidents on a particular road, such as traffic volume, weather conditions, and road design.

Limitations:

Although Poisson regression is a powerful statistical model for analyzing count data, it has some limitations that should be considered. Here are some of the limitations of the Poisson regression model:

Overdispersion: The Poisson regression model assumes that the variance of the dependent variable is equal to its mean, which is called equidispersion. However, in some cases, the variance may be larger than the mean, which is called overdispersion. Overdispersion can lead to biased parameter estimates and incorrect inferences.
Zero Inflation: The Poisson regression model assumes that the probability of a count being zero is low. However, in some cases, there may be an excess number of zeros in the data, which is called zero inflation. Poisson regression may not be appropriate for such data, and alternative models such as zero-inflated Poisson regression or hurdle models may be more appropriate.
Nonlinear relationships: The Poisson regression model assumes that the relationship between the dependent variable and the independent variables is linear. If the relationship is nonlinear, then the Poisson regression model may not capture the true relationship between the variables.
Outliers: The Poisson regression model is sensitive to outliers, which can affect the estimates of the regression coefficients and the goodness-of-fit of the model.
Independence: The Poisson regression model assumes that the observations are independent of each other. However, in some cases, the observations may be correlated, which can violate the independence assumption of the model.
Model Selection: Poisson regression models can be complex, and there may be several models with different independent variables that can fit the data equally well. Choosing the best model can be challenging and requires careful consideration of the model assumptions, goodness-of-fit, and predictive accuracy.
Large counts: The Poisson regression model may not be appropriate for count data with very large counts, especially if the sample size is small. As the counts get larger, the normal approximation used in the Poisson regression model may become less accurate, and alternative models such as negative binomial regression may be more appropriate.
Continuous independent variables: The Poisson regression model assumes that the independent variables are categorical or count variables. If the independent variables are continuous, the Poisson regression model may not capture the true relationship between the variables. In such cases, a transformation of the independent variables or the use of alternative models such as generalized linear models may be necessary.
Missing data: The Poisson regression model may not be appropriate for count data with missing values, as the missing values can affect the estimates of the regression coefficients and the goodness-of-fit of the model. In such cases, methods such as multiple imputation or maximum likelihood estimation may be used to handle missing data.
Model assumptions: The Poisson regression model assumes that the data follows a Poisson distribution and that the relationship between the dependent variable and the independent variables is linear. Violations of these assumptions can lead to biased parameter estimates and incorrect inferences. Therefore, it is important to assess the goodness-of-fit of the model and to consider alternative models if the assumptions are not met.
Zero truncation: Zero truncation occurs when the data contains only positive counts, and there are no zeros. This can happen when the data is collected from a population that is guaranteed to produce at least one count. The Poisson regression model assumes that the probability of a count being zero is positive, and therefore may not be appropriate for zero-truncated data. In such cases, zero-truncated Poisson regression or zero-inflated Poisson regression may be more appropriate.
Zero asymmetry: Zero asymmetry occurs when there are more zeros in the data than would be expected under a Poisson distribution. This can happen when the data contains a mixture of two populations, one with a high probability of producing zero counts and another with a Poisson distribution for the positive counts. The Poisson regression model assumes that the probability of a count being zero is low, and therefore may not be appropriate for zero-asymmetric data. In such cases, zero-inflated Poisson regression or hurdle models may be more appropriate.

In summary, while the Poisson regression model is a powerful tool for analyzing count data, it has several limitations that should be considered. Choosing the appropriate model requires careful consideration of the characteristics of the data and the assumptions of the model.

Alternative Models

Zero-truncation: If the count data has zero truncation (data contains only positive counts, and there are no zeros), Zero-truncated Poisson regression (or Zero-truncated Negative binomial regression) is an alternative to Poisson regression when the data contains only positive counts, and there are no zeros. This model assumes that the data follows a Poisson distribution (or negative binomial), but with the probability of a count being zero removed from the distribution.
Zero asymmetry: If the count data has more zeros than would be expected under a Poisson distribution, then Poisson regression may not be appropriate. Geometric regression can be an alternative if the data follows a geometric distribution, where the probability of observing the first success (i.e., non-zero count) follows a geometric distribution.
Overdispersion: If the variance of the count data is greater than the mean, then the assumption of equidispersion in Poisson regression is violated. Negative binomial regression is often used as an alternative in such situations. However, if the count data follows a geometric distribution, then geometric regression may be a suitable alternative as it can handle overdispersion.
Underdispersion: If the variance of the count data is less than the mean, then the assumption of equidispersion in Poisson regression is violated. Quasi-poisson regression is often used as an alternative in such situations. However, if the count data follows a geometric distribution, then geometric regression may be a suitable alternative as it can handle underdispersion.
Zero Inflation: Zero-inflated Poisson regression is an alternative to Poisson regression when there is zero inflation in the data, meaning that there are more zeros than expected under a Poisson distribution. This model is a mixture of two distributions, a Poisson distribution for the non-zero counts and a point mass at zero for the excess zeros.
Zero Inflation: Hurdle models are another alternative to Poisson regression when there is zero inflation in the data. This model is a two-part model, with a logistic regression model predicting the probability of a count being zero (the hurdle) and a truncated Poisson or negative binomial model for the positive counts.

10 Count Regression Models

Poisson Regression Model:

The Poisson regression model is used when the count data have a constant mean and variance. The Poisson regression model assumes that the count data follow a Poisson distribution and models the expected count as a function of one or more predictor variables.
Example: A researcher wants to model the number of traffic accidents per day on a busy road. The data show that the number of accidents per day is relatively stable over time and across different weather conditions.

Negative Binomial Regression Model:

The Negative Binomial regression model is used when the count data have overdispersion, meaning that the variance is greater than the mean. The Negative Binomial regression model assumes that the count data follow a Negative Binomial distribution and models the expected count as a function of one or more predictor variables.
Example: A researcher wants to model the number of books read per month by a book club. The data show that some members read many more books than others, resulting in overdispersion.

Geometric Regression Model:

The Geometric regression model is used when the count data have a probability of zero counts and a decreasing probability of non-zero counts. The Geometric regression model assumes that the count data follow a Geometric distribution and models the expected count as a function of one or more predictor variables.
Example: A researcher wants to model the number of attempts it takes for a student to pass a difficult exam. The data show that some students pass on their first attempt, but most require multiple attempts.

Quasi-Poisson Regression Model:

The Quasi-Poisson regression model is used when the count data have underdispersion (or overdispersion) and the Negative Binomial regression model cannot be used due to lack of convergence or other issues. The Quasi-Poisson regression model assumes that the count data follow a Poisson distribution, but allows for the variance to be greater than the mean.
Example: A researcher wants to model the number of visits per day to a website. The data show a lot of variability in the number of visits per day, but the Poisson and Negative Binomial models do not provide a good fit.

Zero-Inflated Poisson (ZIP) Model:

The ZIP model assumes that the count data follow a Poisson distribution but includes an additional component to account for the excess zeros. The model consists of two parts: a binary component that models the occurrence of zeros and a Poisson component that models the non-zero counts. The binary component can be modeled using a logistic regression model, and the Poisson component can be modeled using a Poisson regression model.
Example: A researcher wants to model the number of medical appointments per month for patients with a chronic illness. The data show an excess of patients who did not schedule any appointments in a given month.

Zero-Truncated Poisson (ZTP) Model:

The ZTP model assumes that the count data follow a Poisson distribution but excludes the zeros from the model. The ZTP model is used when the zeros in the data are not excess zeros, but are instead the result of a truncated distribution. The ZTP model can be fitted using a Poisson regression model on the non-zero counts only.
Example: A researcher wants to model the number of fish caught per fishing trip. The data show that some fishing trips resulted in zero fish caught due to weather conditions or other factors, but these zeros are not excess zeros.

Hurdle Poisson Model:

The hurdle model is a two-part model that is similar to the ZIP model but uses a different distribution for the non-zero counts. The hurdle model consists of a binary component that models the occurrence of zeros and a truncated Poisson or negative binomial component that models the non-zero counts.
Example: A researcher wants to model the number of car accidents per day on a busy intersection. The data show an excess of days with no accidents and a right-skewed distribution of the non-zero counts.

Zero-Inflated Negative Binomial (ZINB) Model:

The ZINB model is a generalization of the ZIP model that allows for overdispersion in the count data. The ZINB model consists of two parts: a binary component that models the occurrence of zeros and a negative binomial component that models the non-zero counts. The binary component can be modeled using a logistic regression model, and the negative binomial component can be modeled using a negative binomial regression model.
Example: A researcher wants to model the number of accidents per month in a city. The data show an excess of months with no accidents and overdispersion in the non-zero counts.

Zero-Truncated Negative Binomial (ZTNB) Model:

The ZTNB model is a generalization of the ZTP model that allows for overdispersion in the count data. The ZTNB model assumes that the count data follow a negative binomial distribution but excludes the zeros from the model. The ZTNB model can be fitted using a negative binomial regression model on the non-zero counts only.
Example: A researcher wants to model the number of books read per month by a book club. The data show that members who did not read any books in a given month did not attend the book club meeting.

Hurdle Negative Binomial Model:

The hurdle model is a two-part model that is similar to the ZIP model but uses a different distribution for the non-zero counts. The hurdle model consists of a binary component that models the occurrence of zeros and a truncated negative binomial component that models the non-zero counts.
Example: A researcher wants to model the number of visits per month to a website. The data show an excess of visitors who did not visit the website in a given month and a right-skewed distribution of the non-zero counts.

Real-Life Example - Drafting a Research Paper

A researcher wants to model “Risk Factors of Underfive Mortality in Togdheer: Application of Count Regression Models”. Evidence from the SLDHS data 2020.

Introduction

Under-five mortality refers to the death of children under the age of five years. Despite the significant progress that has been made in reducing under-five mortality globally, the burden of child mortality remains high in many developing countries, including Togdheer. Togdheer is one of the regions in Somaliland that has been severely affected by conflict, drought, and poverty, leading to poor health outcomes, including high under-five mortality rates.

Understanding the risk factors associated with under-five mortality in Togdheer is crucial for developing effective strategies to reduce child mortality. The 2020 Somaliland Demographic and Health Survey (SLDHS) provides a unique opportunity to investigate the risk factors associated with under-five mortality in Togdheer.

This study aims to investigate the risk factors of under-five mortality in Togdheer, using count regression models. The study will utilize the SLDHS 2020 data to identify the factors associated with under-five mortality in Togdheer and estimate the magnitude of their effects.

The findings of this study will contribute to the existing literature on child mortality and provide important insights for policymakers and stakeholders in Togdheer to develop and implement effective interventions to reduce under-five mortality. Additionally, the study will provide evidence-based recommendations for improving child health outcomes in Togdheer and other similar settings.

Materials and Methods

Study Design and Data Source:

This study utilized a cross-sectional study design and analyzed data from the 2020 Somaliland Demographic and Health Survey (SLDHS). The SLDHS is a nationally representative survey that provides data on health indicators, including under-five mortality, maternal and child health, and other relevant demographic and socioeconomic variables.

Study Population:

The study population consisted of children under the age of five years who were born to women aged 15-49 years and were living in Togdheer region at the time of the survey.

Data Collection:

The SLDHS used a stratified two-stage cluster sampling design to select a representative sample of households. In the first stage, clusters (enumeration areas) were selected using probability proportional to size sampling. In the second stage, households were selected using systematic sampling. Trained interviewers collected data using standardized questionnaires, including the Household Questionnaire, Women’s Questionnaire, and Under-Five Questionnaire.

Variables:

The outcome variable for this study was under-five mortality, defined as the number of deaths among children under the age of five years per 1,000 live births. The main independent variables were maternal and child demographic and socioeconomic characteristics, including maternal age, education, occupation, household wealth index, access to health care services, among others.

Statistical Analysis:

Descriptive statistics were used to summarize the characteristics of the study population and the prevalence of under-five mortality. Count regression models, including Poisson, negative binomial, zero inflated poisson, zero inflated negative binomial, hurdle poisson, and hurdle negative binomial regression models, were used to identify the risk factors associated with under-five mortality and estimate the magnitude of their effects.

Ethical Considerations:

The SLDHS was approved by the Somali Ministry of Health and implemented by the National Statistics Directorate. Informed consent was obtained from all participants, and all data were kept confidential and anonymous. This study was approved by the Ethics Review Committee of the University of Togdheer.

Results

Inferential Analysis

# Load library
library(pscl)

## Classes and Methods for R developed in the
## Political Science Computational Laboratory
## Department of Political Science
## Stanford University
## Simon Jackman
## hurdle and zeroinfl functions by Achim Zeileis

library(MASS)
library(actuar)

## Warning: package 'actuar' was built under R version 4.1.3

## 
## Attaching package: 'actuar'

## The following objects are masked from 'package:stats':
## 
##     sd, var

## The following object is masked from 'package:grDevices':
## 
##     cm

# Load data
library(haven)

## Warning: package 'haven' was built under R version 4.1.3

data <- read_dta("Togdheer2.dta")
data<-data[data$region==3,]
data$childsex<-factor(data$childsex)
data$childtwin<-factor(data$childtwin)
data$Residence<-factor(data$Residence)
data$breastfeeding<-factor(data$breastfeeding)
data$Education<-factor(data$Education)
data$toiletfacility<-factor(data$toiletfacility)
data$watersource<-factor(data$watersource)
data$wealthindex<-factor(data$wealthindex)
data$maritalstatus<-factor(data$maritalstatus)
# Fit Poisson model
poisson_model<-glm(numberofdeath~childsex+childtwin+childreneverborn+watersource+toiletfacility+wealthindex+maritalstatus+birthorder+breastfeeding+mothersageatfirstbirth+Residence+Education, data = data, family = poisson())
# Summarize results
summary(poisson_model)

## 
## Call:
## glm(formula = numberofdeath ~ childsex + childtwin + childreneverborn + 
##     watersource + toiletfacility + wealthindex + maritalstatus + 
##     birthorder + breastfeeding + mothersageatfirstbirth + Residence + 
##     Education, family = poisson(), data = data)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.94809  -0.99148   0.02731   0.47494   1.55367  
## 
## Coefficients:
##                          Estimate Std. Error z value Pr(>|z|)    
## (Intercept)            -2.338e+00  2.365e-01  -9.885  < 2e-16 ***
## childsex2              -9.819e-02  6.186e-02  -1.587 0.112465    
## childtwin2              7.330e-03  1.915e-01   0.038 0.969471    
## childreneverborn        1.030e-01  1.316e-02   7.822 5.19e-15 ***
## watersource2            2.566e-01  7.694e-02   3.335 0.000852 ***
## toiletfacility2        -3.069e-01  1.008e-01  -3.046 0.002322 ** 
## wealthindex2            6.405e-01  1.471e-01   4.354 1.34e-05 ***
## wealthindex3            4.012e-01  1.450e-01   2.767 0.005664 ** 
## wealthindex4            5.634e-01  1.377e-01   4.091 4.30e-05 ***
## wealthindex5            5.010e-01  1.405e-01   3.567 0.000362 ***
## maritalstatus2         -1.377e-01  1.133e-01  -1.215 0.224385    
## maritalstatus3         -3.958e-01  1.487e-01  -2.662 0.007758 ** 
## birthorder             -8.997e-06  1.175e-02  -0.001 0.999389    
## breastfeeding2          6.093e-01  9.928e-02   6.137 8.43e-10 ***
## mothersageatfirstbirth  3.417e-02  7.752e-03   4.408 1.04e-05 ***
## Residence2             -1.018e-01  1.041e-01  -0.978 0.328024    
## Residence3             -1.163e-02  9.838e-02  -0.118 0.905877    
## Education2              1.766e-02  1.139e-01   0.155 0.876753    
## Education3             -2.737e-01  2.346e-01  -1.166 0.243415    
## Education4              3.973e-01  4.652e-01   0.854 0.392997    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 1054.56  on 1004  degrees of freedom
## Residual deviance:  778.08  on  985  degrees of freedom
## AIC: 2417.4
## 
## Number of Fisher Scoring iterations: 5

# Fit Negative Binomial model
nb_model <- glm.nb(numberofdeath~childsex+childtwin+childreneverborn+watersource+toiletfacility+wealthindex+maritalstatus+birthorder+breastfeeding+mothersageatfirstbirth+Residence+Education, data = data)

## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached

## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached

# Summarize results
summary(nb_model)

## 
## Call:
## glm.nb(formula = numberofdeath ~ childsex + childtwin + childreneverborn + 
##     watersource + toiletfacility + wealthindex + maritalstatus + 
##     birthorder + breastfeeding + mothersageatfirstbirth + Residence + 
##     Education, data = data, init.theta = 24126.54212, link = log)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9481  -0.9915   0.0273   0.4749   1.5536  
## 
## Coefficients:
##                          Estimate Std. Error z value Pr(>|z|)    
## (Intercept)            -2.338e+00  2.365e-01  -9.884  < 2e-16 ***
## childsex2              -9.819e-02  6.187e-02  -1.587 0.112469    
## childtwin2              7.332e-03  1.915e-01   0.038 0.969462    
## childreneverborn        1.030e-01  1.316e-02   7.822 5.21e-15 ***
## watersource2            2.566e-01  7.694e-02   3.335 0.000853 ***
## toiletfacility2        -3.069e-01  1.008e-01  -3.046 0.002322 ** 
## wealthindex2            6.405e-01  1.471e-01   4.354 1.34e-05 ***
## wealthindex3            4.013e-01  1.450e-01   2.767 0.005663 ** 
## wealthindex4            5.635e-01  1.377e-01   4.091 4.30e-05 ***
## wealthindex5            5.010e-01  1.405e-01   3.567 0.000362 ***
## maritalstatus2         -1.377e-01  1.133e-01  -1.215 0.224399    
## maritalstatus3         -3.958e-01  1.487e-01  -2.662 0.007760 ** 
## birthorder             -8.966e-06  1.175e-02  -0.001 0.999391    
## breastfeeding2          6.093e-01  9.929e-02   6.136 8.44e-10 ***
## mothersageatfirstbirth  3.417e-02  7.752e-03   4.408 1.04e-05 ***
## Residence2             -1.018e-01  1.041e-01  -0.978 0.328041    
## Residence3             -1.164e-02  9.838e-02  -0.118 0.905821    
## Education2              1.766e-02  1.139e-01   0.155 0.876721    
## Education3             -2.737e-01  2.346e-01  -1.166 0.243415    
## Education4              3.973e-01  4.652e-01   0.854 0.393011    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(24126.54) family taken to be 1)
## 
##     Null deviance: 1054.52  on 1004  degrees of freedom
## Residual deviance:  778.06  on  985  degrees of freedom
## AIC: 2419.4
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  24127 
##           Std. Err.:  96258 
## Warning while fitting theta: iteration limit reached 
## 
##  2 x log-likelihood:  -2377.401

# Fit a Geometric regression model
geom_model<-glm(numberofdeath~childsex+childtwin+childreneverborn+watersource+toiletfacility+wealthindex+maritalstatus+birthorder+breastfeeding+mothersageatfirstbirth+Residence+Education, family = negative.binomial(theta = 1), data = data)

# Summarize results
summary(geom_model)

## 
## Call:
## glm(formula = numberofdeath ~ childsex + childtwin + childreneverborn + 
##     watersource + toiletfacility + wealthindex + maritalstatus + 
##     birthorder + breastfeeding + mothersageatfirstbirth + Residence + 
##     Education, family = negative.binomial(theta = 1), data = data)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.45923  -0.88283   0.00541   0.32887   1.14845  
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            -2.2621302  0.1930537 -11.718  < 2e-16 ***
## childsex2              -0.1234476  0.0524175  -2.355 0.018714 *  
## childtwin2              0.0386582  0.1773064   0.218 0.827451    
## childreneverborn        0.0984846  0.0113034   8.713  < 2e-16 ***
## watersource2            0.2846673  0.0624799   4.556 5.86e-06 ***
## toiletfacility2        -0.3165782  0.0839291  -3.772 0.000172 ***
## wealthindex2            0.6530565  0.1168432   5.589 2.95e-08 ***
## wealthindex3            0.5661597  0.1148729   4.929 9.71e-07 ***
## wealthindex4            0.6495279  0.1066808   6.089 1.63e-09 ***
## wealthindex5            0.6007061  0.1093554   5.493 5.03e-08 ***
## maritalstatus2         -0.1328160  0.0945902  -1.404 0.160598    
## maritalstatus3         -0.3914590  0.1175064  -3.331 0.000896 ***
## birthorder              0.0003474  0.0105011   0.033 0.973616    
## breastfeeding2          0.6181614  0.0748208   8.262 4.59e-16 ***
## mothersageatfirstbirth  0.0293571  0.0066517   4.413 1.13e-05 ***
## Residence2             -0.0888183  0.0876459  -1.013 0.311129    
## Residence3             -0.0779193  0.0823983  -0.946 0.344563    
## Education2              0.0535216  0.0980970   0.546 0.585465    
## Education3             -0.3428470  0.1853727  -1.850 0.064685 .  
## Education4              0.3853174  0.3781570   1.019 0.308484    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(1) family taken to be 0.3270752)
## 
##     Null deviance: 606.06  on 1004  degrees of freedom
## Residual deviance: 476.96  on  985  degrees of freedom
## AIC: 2805.2
## 
## Number of Fisher Scoring iterations: 6

# Fit Quasi-Poisson model
qp_model <- glm(numberofdeath~childsex+childtwin+childreneverborn+watersource+toiletfacility+wealthindex+maritalstatus+birthorder+breastfeeding+mothersageatfirstbirth+Residence+Education, data = data, family = quasipoisson())

# Summarize results
summary(qp_model)

## 
## Call:
## glm(formula = numberofdeath ~ childsex + childtwin + childreneverborn + 
##     watersource + toiletfacility + wealthindex + maritalstatus + 
##     birthorder + breastfeeding + mothersageatfirstbirth + Residence + 
##     Education, family = quasipoisson(), data = data)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.94809  -0.99148   0.02731   0.47494   1.55367  
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            -2.338e+00  1.870e-01 -12.500  < 2e-16 ***
## childsex2              -9.819e-02  4.892e-02  -2.007 0.045003 *  
## childtwin2              7.330e-03  1.514e-01   0.048 0.961408    
## childreneverborn        1.030e-01  1.041e-02   9.892  < 2e-16 ***
## watersource2            2.566e-01  6.084e-02   4.218 2.70e-05 ***
## toiletfacility2        -3.069e-01  7.969e-02  -3.852 0.000125 ***
## wealthindex2            6.405e-01  1.163e-01   5.506 4.69e-08 ***
## wealthindex3            4.012e-01  1.147e-01   3.499 0.000488 ***
## wealthindex4            5.634e-01  1.089e-01   5.173 2.79e-07 ***
## wealthindex5            5.010e-01  1.111e-01   4.510 7.25e-06 ***
## maritalstatus2         -1.377e-01  8.963e-02  -1.536 0.124756    
## maritalstatus3         -3.958e-01  1.175e-01  -3.367 0.000790 ***
## birthorder             -8.997e-06  9.291e-03  -0.001 0.999228    
## breastfeeding2          6.093e-01  7.851e-02   7.760 2.11e-14 ***
## mothersageatfirstbirth  3.417e-02  6.130e-03   5.575 3.20e-08 ***
## Residence2             -1.018e-01  8.233e-02  -1.237 0.216417    
## Residence3             -1.163e-02  7.779e-02  -0.150 0.881168    
## Education2              1.766e-02  9.003e-02   0.196 0.844554    
## Education3             -2.737e-01  1.855e-01  -1.475 0.140494    
## Education4              3.973e-01  3.678e-01   1.080 0.280312    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for quasipoisson family taken to be 0.6253103)
## 
##     Null deviance: 1054.56  on 1004  degrees of freedom
## Residual deviance:  778.08  on  985  degrees of freedom
## AIC: NA
## 
## Number of Fisher Scoring iterations: 5

# Fit ZIP model
zip_model <-zeroinfl(numberofdeath~childsex+childtwin+childreneverborn+watersource+toiletfacility+wealthindex+maritalstatus+birthorder+breastfeeding+mothersageatfirstbirth+Residence+Education|1, data = data, dist = "poisson")

# Summarize results
summary(zip_model)

## 
## Call:
## zeroinfl(formula = numberofdeath ~ childsex + childtwin + childreneverborn + 
##     watersource + toiletfacility + wealthindex + maritalstatus + birthorder + 
##     breastfeeding + mothersageatfirstbirth + Residence + Education | 
##     1, data = data, dist = "poisson")
## 
## Pearson residuals:
##      Min       1Q   Median       3Q      Max 
## -1.37754 -0.70104  0.02747  0.52146  1.91064 
## 
## Count model coefficients (poisson with log link):
##                          Estimate Std. Error z value Pr(>|z|)    
## (Intercept)            -2.338e+00  2.365e-01  -9.886  < 2e-16 ***
## childsex2              -9.820e-02  6.186e-02  -1.587 0.112426    
## childtwin2              7.261e-03  1.915e-01   0.038 0.969758    
## childreneverborn        1.030e-01  1.316e-02   7.823 5.17e-15 ***
## watersource2            2.566e-01  7.694e-02   3.335 0.000852 ***
## toiletfacility2        -3.069e-01  1.008e-01  -3.045 0.002325 ** 
## wealthindex2            6.406e-01  1.471e-01   4.354 1.34e-05 ***
## wealthindex3            4.013e-01  1.450e-01   2.767 0.005652 ** 
## wealthindex4            5.635e-01  1.377e-01   4.091 4.30e-05 ***
## wealthindex5            5.011e-01  1.405e-01   3.567 0.000361 ***
## maritalstatus2         -1.377e-01  1.133e-01  -1.215 0.224487    
## maritalstatus3         -3.957e-01  1.486e-01  -2.662 0.007769 ** 
## birthorder             -1.015e-05  1.175e-02  -0.001 0.999311    
## breastfeeding2          6.093e-01  9.929e-02   6.137 8.42e-10 ***
## mothersageatfirstbirth  3.418e-02  7.751e-03   4.409 1.04e-05 ***
## Residence2             -1.018e-01  1.041e-01  -0.978 0.328076    
## Residence3             -1.166e-02  9.838e-02  -0.119 0.905636    
## Education2              1.771e-02  1.139e-01   0.156 0.876423    
## Education3             -2.736e-01  2.346e-01  -1.166 0.243486    
## Education4              3.980e-01  4.650e-01   0.856 0.392111    
## 
## Zero-inflation model coefficients (binomial with logit link):
##             Estimate Std. Error z value Pr(>|z|)
## (Intercept)   -14.85     108.51  -0.137    0.891
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Number of iterations in BFGS optimization: 45 
## Log-likelihood: -1189 on 21 Df

# Fit ZTP model
ztp_model <-glm(numberofdeath~childsex+childtwin+childreneverborn+watersource+toiletfacility+wealthindex+maritalstatus+birthorder+breastfeeding+mothersageatfirstbirth+Residence+Education, data = subset(data, numberofdeath > 0), family = poisson())

# Summarize results
summary(ztp_model)

## 
## Call:
## glm(formula = numberofdeath ~ childsex + childtwin + childreneverborn + 
##     watersource + toiletfacility + wealthindex + maritalstatus + 
##     birthorder + breastfeeding + mothersageatfirstbirth + Residence + 
##     Education, family = poisson(), data = subset(data, numberofdeath > 
##     0))
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -0.95934  -0.35266  -0.09722   0.25453   1.24787  
## 
## Coefficients:
##                          Estimate Std. Error z value Pr(>|z|)    
## (Intercept)            -0.8124986  0.2345634  -3.464 0.000532 ***
## childsex2               0.0079055  0.0626173   0.126 0.899533    
## childtwin2             -0.0762963  0.1921469  -0.397 0.691314    
## childreneverborn        0.0648403  0.0129062   5.024 5.06e-07 ***
## watersource2            0.0457141  0.0801554   0.570 0.568462    
## toiletfacility2        -0.3505887  0.1004278  -3.491 0.000481 ***
## wealthindex2            0.4110245  0.1473896   2.789 0.005292 ** 
## wealthindex3            0.1046121  0.1453998   0.719 0.471846    
## wealthindex4            0.1605622  0.1363616   1.177 0.239007    
## wealthindex5            0.0516197  0.1422557   0.363 0.716705    
## maritalstatus2          0.0703858  0.1164930   0.604 0.545707    
## maritalstatus3          0.0135826  0.1521566   0.089 0.928870    
## birthorder             -0.0005783  0.0118074  -0.049 0.960939    
## breastfeeding2          0.1732129  0.1015624   1.705 0.088105 .  
## mothersageatfirstbirth  0.0290687  0.0076072   3.821 0.000133 ***
## Residence2             -0.2246540  0.1104497  -2.034 0.041952 *  
## Residence3              0.1017992  0.1011931   1.006 0.314421    
## Education2              0.1318192  0.1134805   1.162 0.245397    
## Education3              0.1691714  0.2397268   0.706 0.480385    
## Education4              0.0123514  0.4659500   0.027 0.978852    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 251.07  on 693  degrees of freedom
## Residual deviance: 136.12  on 674  degrees of freedom
## AIC: 1775.4
## 
## Number of Fisher Scoring iterations: 4

# Fit hurdle poisson model
hurdlep_model<-hurdle(numberofdeath~childsex+childtwin+childreneverborn+watersource+toiletfacility+wealthindex+maritalstatus+birthorder+breastfeeding+mothersageatfirstbirth+Residence+Education|Residence, data = data, dist = "poisson")
# Summarize results
summary(hurdlep_model)

## 
## Call:
## hurdle(formula = numberofdeath ~ childsex + childtwin + childreneverborn + 
##     watersource + toiletfacility + wealthindex + maritalstatus + birthorder + 
##     breastfeeding + mothersageatfirstbirth + Residence + Education | 
##     Residence, data = data, dist = "poisson")
## 
## Pearson residuals:
##     Min      1Q  Median      3Q     Max 
## -1.4669 -1.1000  0.1561  0.4921  2.8689 
## 
## Count model coefficients (truncated poisson with log link):
##                          Estimate Std. Error z value Pr(>|z|)    
## (Intercept)            -3.358e+00  4.528e-01  -7.416 1.21e-13 ***
## childsex2               5.437e-02  9.745e-02   0.558  0.57690    
## childtwin2             -1.839e-01  2.788e-01  -0.660  0.50945    
## childreneverborn        1.583e-01  2.176e-02   7.275 3.47e-13 ***
## watersource2            4.877e-02  1.537e-01   0.317  0.75096    
## toiletfacility2        -9.117e-01  1.917e-01  -4.755 1.99e-06 ***
## wealthindex2            8.911e-01  2.751e-01   3.240  0.00120 ** 
## wealthindex3            1.108e-01  2.720e-01   0.407  0.68373    
## wealthindex4            5.008e-01  2.623e-01   1.909  0.05624 .  
## wealthindex5            6.011e-02  2.768e-01   0.217  0.82806    
## maritalstatus2          2.917e-01  1.935e-01   1.507  0.13172    
## maritalstatus3          1.285e-01  2.534e-01   0.507  0.61206    
## birthorder             -3.151e-03  1.649e-02  -0.191  0.84847    
## breastfeeding2          7.773e-01  2.373e-01   3.276  0.00105 ** 
## mothersageatfirstbirth  5.642e-02  1.274e-02   4.429 9.46e-06 ***
## Residence2             -5.274e-01  1.973e-01  -2.673  0.00753 ** 
## Residence3              4.022e-01  1.984e-01   2.027  0.04264 *  
## Education2              1.120e-01  1.851e-01   0.605  0.54503    
## Education3              5.508e-01  3.641e-01   1.513  0.13040    
## Education4             -1.382e+01  1.182e+03  -0.012  0.99067    
## Zero hurdle model coefficients (binomial with logit link):
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.80922    0.14911   5.427 5.73e-08 ***
## Residence2   0.07808    0.20142   0.388    0.698    
## Residence3  -0.05024    0.17584  -0.286    0.775    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Number of iterations in BFGS optimization: 43 
## Log-likelihood: -1185 on 23 Df

# Fit ZINB model
zinb_model <-zeroinfl(numberofdeath~childsex+childtwin+childreneverborn+watersource+toiletfacility+wealthindex+maritalstatus+birthorder+breastfeeding+mothersageatfirstbirth+Residence+Education|Residence, data = data, dist = "negbin")

## Warning in value[[3L]](cond): system is computationally singular: reciprocal
## condition number = 5.40204e-37FALSE

# Summarize results
summary(zinb_model)

## 
## Call:
## zeroinfl(formula = numberofdeath ~ childsex + childtwin + childreneverborn + 
##     watersource + toiletfacility + wealthindex + maritalstatus + birthorder + 
##     breastfeeding + mothersageatfirstbirth + Residence + Education | 
##     Residence, data = data, dist = "negbin")
## 
## Pearson residuals:
##      Min       1Q   Median       3Q      Max 
## -1.37751 -0.70108  0.02744  0.52144  1.91058 
## 
## Count model coefficients (negbin with log link):
##                          Estimate Std. Error z value Pr(>|z|)
## (Intercept)            -2.338e+00         NA      NA       NA
## childsex2              -9.819e-02         NA      NA       NA
## childtwin2              7.334e-03         NA      NA       NA
## childreneverborn        1.030e-01         NA      NA       NA
## watersource2            2.566e-01         NA      NA       NA
## toiletfacility2        -3.069e-01         NA      NA       NA
## wealthindex2            6.405e-01         NA      NA       NA
## wealthindex3            4.012e-01         NA      NA       NA
## wealthindex4            5.634e-01         NA      NA       NA
## wealthindex5            5.010e-01         NA      NA       NA
## maritalstatus2         -1.377e-01         NA      NA       NA
## maritalstatus3         -3.958e-01         NA      NA       NA
## birthorder             -9.030e-06         NA      NA       NA
## breastfeeding2          6.093e-01         NA      NA       NA
## mothersageatfirstbirth  3.417e-02         NA      NA       NA
## Residence2             -1.018e-01         NA      NA       NA
## Residence3             -1.163e-02         NA      NA       NA
## Education2              1.766e-02         NA      NA       NA
## Education3             -2.737e-01         NA      NA       NA
## Education4              3.973e-01         NA      NA       NA
## Log(theta)              1.483e+01         NA      NA       NA
## 
## Zero-inflation model coefficients (binomial with logit link):
##             Estimate Std. Error z value Pr(>|z|)
## (Intercept)  -55.515         NA      NA       NA
## Residence2   -21.636         NA      NA       NA
## Residence3    -5.911         NA      NA       NA
## 
## Theta = 2745497.5196 
## Number of iterations in BFGS optimization: 50 
## Log-likelihood: -1189 on 24 Df

# Fit ZTNB model
ztnb_model <-glm.nb(numberofdeath~childsex+childtwin+childreneverborn+watersource+toiletfacility+wealthindex+maritalstatus+birthorder+breastfeeding+mothersageatfirstbirth+Residence+Education, data = subset(data, numberofdeath > 0))

## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached

## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached

# Summarize results
summary(ztnb_model)

## 
## Call:
## glm.nb(formula = numberofdeath ~ childsex + childtwin + childreneverborn + 
##     watersource + toiletfacility + wealthindex + maritalstatus + 
##     birthorder + breastfeeding + mothersageatfirstbirth + Residence + 
##     Education, data = subset(data, numberofdeath > 0), init.theta = 130858.5048, 
##     link = log)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -0.95933  -0.35266  -0.09722   0.25453   1.24785  
## 
## Coefficients:
##                          Estimate Std. Error z value Pr(>|z|)    
## (Intercept)            -0.8124955  0.2345652  -3.464 0.000533 ***
## childsex2               0.0079056  0.0626178   0.126 0.899533    
## childtwin2             -0.0762969  0.1921485  -0.397 0.691314    
## childreneverborn        0.0648402  0.0129063   5.024 5.06e-07 ***
## watersource2            0.0457144  0.0801560   0.570 0.568463    
## toiletfacility2        -0.3505888  0.1004286  -3.491 0.000481 ***
## wealthindex2            0.4110241  0.1473907   2.789 0.005292 ** 
## wealthindex3            0.1046129  0.1454008   0.719 0.471845    
## wealthindex4            0.1605624  0.1363625   1.177 0.239009    
## wealthindex5            0.0516200  0.1422566   0.363 0.716705    
## maritalstatus2          0.0703858  0.1164938   0.604 0.545709    
## maritalstatus3          0.0135823  0.1521576   0.089 0.928871    
## birthorder             -0.0005783  0.0118075  -0.049 0.960939    
## breastfeeding2          0.1732129  0.1015631   1.705 0.088107 .  
## mothersageatfirstbirth  0.0290686  0.0076073   3.821 0.000133 ***
## Residence2             -0.2246540  0.1104506  -2.034 0.041954 *  
## Residence3              0.1017986  0.1011939   1.006 0.314427    
## Education2              0.1318193  0.1134814   1.162 0.245401    
## Education3              0.1691715  0.2397285   0.706 0.480387    
## Education4              0.0123513  0.4659520   0.027 0.978852    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(130858.5) family taken to be 1)
## 
##     Null deviance: 251.07  on 693  degrees of freedom
## Residual deviance: 136.12  on 674  degrees of freedom
## AIC: 1777.4
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  130859 
##           Std. Err.:  887717 
## Warning while fitting theta: iteration limit reached 
## 
##  2 x log-likelihood:  -1735.423

# Fit hurdle negative binomial model
hurdlenb_model<-hurdle(numberofdeath~childsex+childtwin+childreneverborn+watersource+toiletfacility+wealthindex+maritalstatus+birthorder+breastfeeding+mothersageatfirstbirth+Residence+Education|Residence, data = data, dist = "negbin")

## Warning in sqrt(diag(vc_count)[kx + 1]): NaNs produced

# Summarize results
summary(hurdlenb_model)

## 
## Call:
## hurdle(formula = numberofdeath ~ childsex + childtwin + childreneverborn + 
##     watersource + toiletfacility + wealthindex + maritalstatus + birthorder + 
##     breastfeeding + mothersageatfirstbirth + Residence + Education | 
##     Residence, data = data, dist = "negbin")
## 
## Pearson residuals:
##     Min      1Q  Median      3Q     Max 
## -1.4669 -1.1000  0.1561  0.4922  2.8687 
## 
## Count model coefficients (truncated negbin with log link):
##                         Estimate Std. Error z value Pr(>|z|)    
## (Intercept)             -3.35809    0.45282  -7.416 1.21e-13 ***
## childsex2                0.05446    0.09745   0.559  0.57629    
## childtwin2              -0.18384    0.27875  -0.660  0.50957    
## childreneverborn         0.15828    0.02176   7.274 3.48e-13 ***
## watersource2             0.04881    0.15368   0.318  0.75078    
## toiletfacility2         -0.91178    0.19173  -4.755 1.98e-06 ***
## wealthindex2             0.89107    0.27505   3.240  0.00120 ** 
## wealthindex3             0.11082    0.27204   0.407  0.68373    
## wealthindex4             0.50077    0.26230   1.909  0.05624 .  
## wealthindex5             0.06013    0.27677   0.217  0.82801    
## maritalstatus2           0.29167    0.19351   1.507  0.13175    
## maritalstatus3           0.12853    0.25340   0.507  0.61198    
## birthorder              -0.00315    0.01649  -0.191  0.84854    
## breastfeeding2           0.77731    0.23729   3.276  0.00105 ** 
## mothersageatfirstbirth   0.05641    0.01274   4.429 9.46e-06 ***
## Residence2              -0.52740    0.19732  -2.673  0.00752 ** 
## Residence3               0.40220    0.19841   2.027  0.04265 *  
## Education2               0.11206    0.18509   0.605  0.54487    
## Education3               0.55082    0.36414   1.513  0.13037    
## Education4             -12.73724    1.26628 -10.059  < 2e-16 ***
## Log(theta)              14.07542        NaN     NaN      NaN    
## Zero hurdle model coefficients (binomial with logit link):
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.80922    0.14911   5.427 5.73e-08 ***
## Residence2   0.07808    0.20142   0.388    0.698    
## Residence3  -0.05024    0.17584  -0.286    0.775    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Theta: count = 1296814.9889
## Number of iterations in BFGS optimization: 44 
## Log-likelihood: -1185 on 24 Df

In summary, Poisson, Negative Binomial, Geometric, Quasi-Poisson, ZIP, ZINB, ZTP, ZTNB, hurdle poisson, and hurdle negative binomial regression models are all useful for analyzing count data, but they differ in their assumptions and approach. The choice of model depends on the nature of the data and the specific research question being addressed.

Model Comparison

# AIC
aic<-AIC(poisson_model,nb_model,zip_model,zinb_model,hurdlep_model,hurdlenb_model)
bic<-BIC(poisson_model,nb_model,zip_model,zinb_model,hurdlep_model,hurdlenb_model)
aic

##                df      AIC
## poisson_model  20 2417.383
## nb_model       21 2419.401
## zip_model      21 2419.383
## zinb_model     24 2425.383
## hurdlep_model  23 2415.942
## hurdlenb_model 24 2417.942

bic

##                df      BIC
## poisson_model  20 2515.637
## nb_model       21 2522.569
## zip_model      21 2522.550
## zinb_model     24 2543.289
## hurdlep_model  23 2528.935
## hurdlenb_model 24 2535.848

Interpretation

The results presented show the Akaike Information Criterion (AIC)and Bayesian Information Criterion (BIC) values for different count regression models fitted to the data. The information criterions (AIC and BIC) are a measure of the goodness of fit of a statistical model, where a lower (AIC and BIC) values indicates a better fit of the model to the data.

In this study, six count regression models were fitted to the data, including Poisson, negative binomial (NB), zero-inflated Poisson (ZIP), zero-inflated negative binomial (ZINB),hurdle poisson and hurdle negative binomial regression models.

The results show that the hurdle poisson regression model had the lowest AIC and BIC values of 2415.942 and 2528.935 respectively, indicating that it is the best-fitting model among the models considered. The hurdle poisson model is a two-part model that accounts for excess zeros in the data and equidistribution, making it a suitable model for count data with excess zeros and equidistribution.

The hurdle negative binomial model had the second-lowest AIC and BIC values of 2417.942 and 2535.848 respectively, followed by the Poisson model and the NB and ZIP models. The ZINB model had a slightly higher AIC value compared to the hurdle and the poisson models.

Overall, the results suggest that the hurdle poisson model provides the best fit for the data and should be used to identify the risk factors associated with under-five mortality in Togdheer.

Interpretation based on the best model

The results presented are from the hurdle Poisson regression model, which is a two-part model that accounts for excess zeros in the data and equidistribution. The model was used to investigate the risk factors associated with under-five mortality in Togdheer, using a set of independent variables.

The results are presented in two parts. The first part of the results shows the count model coefficients, which represent the association between the independent variables and the count of deaths among children under the age of five years. The intercept term represents the expected count of deaths when all other independent variables are held constant.

The results show that childreneverborn, toilet facility, Residence, breastfeeding, and mothersageatfirstbirth were significantly associated with the count of deaths among children under the age of five years. Specifically, children whose mothers used an improved toilets, were located in rural or nomadic, breastfed their children, and had their first child at a later age were associated with a lower count of under-five mortality.

Other variables, such as childsex, childtwin, watersource, wealthindex, birthorder, and Education, were not significantly associated with the count of under-five mortality in this study.

The second part of the results shows the zero hurdle model coefficients, which represent the association between the intercept and the probability of observing zero deaths among children under the age of five years. The results show that the intercept term was significantly associated with the probability of observing zero deaths.

The model’s log-likelihood was -5531, indicating a good fit of the model to the data. The Pearson residuals show that the model’s residuals are approximately normally distributed.

Overall, the results suggest that factors related to maternal and child health, such as childreneverborn, watersource, maritalstatus, breastfeeding, and mothersageatfirstbirth, are important predictors of under-five mortality in Togdheer. Policymakers and stakeholders should focus on implementing interventions to improve maternal and child health outcomes to reduce under-five mortality in the region.

For more information about drafting research paper using count regression models we can refer readers to (Argawu et al., 2022).

Summary

Negative binomial regression: Negative binomial regression becomes an alternative to Poisson regression when there is overdispersion in the data, which means that the variance is greater than the mean.
Quasi-Poisson regression: Quasi-Poisson regression becomes an alternative to Poisson regression when there is overdispersion in the data, but the negative binomial distribution is not appropriate, or when there is underdispersion in the data.
Geometric regression: Geometric regression becomes an alternative to Poisson regression when the data contains only non-zero counts and the dv follows a geometric distribution.
Zero-inflated Poisson regression: Zero-inflated Poisson regression becomes an alternative to Poisson regression when the data contains excess zeros, which means that there are more zeros in the data than would be expected from a Poisson distribution.
Zero-inflated negative binomial regression: Zero-inflated negative binomial regression becomes an alternative to Poisson regression when the data contains excess zeros and overdispersion.
Zero-truncated Poisson regression: Zero-truncated Poisson regression becomes an alternative to Poisson regression when the data contains non-zero counts, and there is equidistribution of the data.
Zero-truncated negative binomial regression: Zero-truncated negative binomial regression becomes an alternative to Poisson regression when the data contains non-zero counts, and the variance is greater than the mean.
Hurdle Poisson regression: Hurdle Poisson regression becomes an alternative to Poisson regression when the data contains excess zeros and no disperion.
Hurdle negative binomial regression: Hurdle negative binomial regression becomes an alternative to Poisson regression when the data contains excess zeros, and overdispersion.

Overall, these different regression models become alternatives to Poisson regression in various situations where the assumptions of Poisson regression are not met or where the data has certain characteristics such as excess zeros, overdispersion, or only non-zero counts. Understanding the appropriate regression model to use for a given data set is important for obtaining accurate and reliable predictions and insights.

Module IV: Regression Models for Categorical Variable Responses

Regression Models for Ordinal Variable Responses

Regression models with an ordinal response variable are commonly used in situations where the outcome of interest is an ordered categorical variable with three or more levels. In this type of model, the outcome variable is modeled as a function of one or more predictor variables. Here are five regression models commonly used with an ordinal response variable, along with their definition, application, assumptions, real-life examples, and R code:

Ordinal Logistic Regression Model

Definition: Ordinal logistic regression is a statistical model used to analyze the relationship between one or more predictor variables and an ordinal outcome variable with three or more levels. In this model, the odds of being in a higher category of the outcome variable are modeled as a function of the predictor variables.

Application: Ordinal logistic regression can be used in various fields, such as healthcare, social sciences, and marketing, to analyze the relationship between predictor variables and an ordinal outcome variable, such as the severity of a disease, level of education, or customer satisfaction.

Assumptions: The assumptions of ordinal logistic regression include the proportional odds assumption, which states that the odds of being in a higher category of the outcome variable are the same across all levels of the predictor variables.

Real-life example: A researcher wants to investigate the relationship between the level of education and the severity of depression in a sample of patients. The outcome variable is the severity of depression, which ranges from mild to severe, and the predictor variable is the level of education, which is categorized as high school, college, or graduate school.

Ordered Probit Regression Model

Definition: Ordered probit regression is a regression model used to analyze the relationship between one or more predictor variables and an ordinal outcome variable with three or more levels. In this model, the probability of being in each category of the outcome variable is modeled as a function of the predictor variables using a probit link function.

Application: Ordered probit regression can be used in various fields, such as healthcare, social sciences, and marketing, to analyze the relationship between predictor variables and an ordinal outcome variable, such as the severity of a disease, level of education, or customer satisfaction.

Assumptions: The assumptions of ordered probit regression include the normality of the error term and the independence of the error terms across categories of the outcome variable.

Real-life example: A researcher wants to investigate the relationship between the level of physical activity and the level of depression in a sample of adults. The outcome variable is the level of depression, which is categorized as mild, moderate, or severe, and the predictor variable is the level of physical activity, which is categorized as low, moderate, or high.

Cumulative logit model:

Definition: The cumulative logit model is a regression model used to analyze the relationship between one or more predictor variables and an ordinal outcome variable with three or more levels. In this model, the cumulative probabilities of being in each category of the outcome variable are modeled as a function of the predictor variables.

Application: Cumulative logit models can be used in various fields, such as healthcare, social sciences, and marketing, to analyze the relationship between predictor variables and an ordinal outcome variable, such as the severity of a disease, level of education, or customer satisfaction.

Assumptions: The assumptions of the cumulative logit model include the proportional odds assumption, which states that the odds of being in a higher category of the outcome variable are the same across all levels of the predictor variables.

Real-life example: A researcher wants to investigate the relationship between the level of education and the level of income in a sample of workers. The outcome variable is the level of income, which is categorized as low, medium, or high, and the predictor variable is the level of education, which is categorized as high school, college, or graduate school.

Proportional Odds Model

Definition: Proportional odds model is a type of ordinal logistic regression model where the probability of a response variable taking a particular value or lower is modeled as a function of the predictor variables. It is used when the response variable is ordered or ordinal, and the categories have a natural ordering.

Application: The proportional odds model can be applied to a wide range of research areas, including social sciences, medical research, and marketing. It can be used to investigate the relationship between predictor variables and a range of ordinal outcomes, such as Likert scale responses, disease stages, or educational levels.

Assumptions: The proportional odds model assumes that the odds ratios for each predictor variable are constant across all levels of the response variable. It also assumes that the relationship between the predictor variables and the response variable is linear on the log-odds scale.

Real-life example: Suppose a researcher wants to investigate the relationship between age, gender, and income on the level of education attained. The response variable is ordered, with categories of “less than high school,” “high school,” “some college,” and “college degree or higher.” The proportional odds model can be used to model the probability of attaining a certain level of education based on age, gender, and income.

Partial Proportional Odds Model

Definition: The partial proportional odds model is a regression model used to analyze the relationship between one or more predictor variables and an ordinal outcome variable with three or more levels. In this model, the proportional odds assumption is relaxed for one or more of the predictor variables.

*Application: Partial proportional odds models can be used in various fields, such as healthcare, social sciences, and marketing, to analyze the relationship between predictor variables and an ordinal outcome variable when the proportional odds assumption is violated.

Assumptions: The assumptions of the partial proportional odds model include the assumption that the proportional odds assumption holds for at least one of the predictor variables.

Real-life example: A researcher wants to investigate the relationship between the level of social support and the level of depression in a sample of adults. The outcome variable is the level of depression, which is categorized as mild, moderate, or severe, and the predictor variable is the level of social support, which is categorized as low, moderate, or high. However, the proportional odds assumption does not hold for the level of social support variable.

R Code for All

# Load the package and data
library(MASS)
library(ordinal)

## Warning: package 'ordinal' was built under R version 4.1.3

## 
## Attaching package: 'ordinal'

## The following objects are masked from 'package:actuar':
## 
##     dgumbel, dlgamma, pgumbel, plgamma, qgumbel, rgumbel

## The following object is masked from 'package:dplyr':
## 
##     slice

data(Cars93)

# Create an ordered response variable
Cars93$Price <- cut(Cars93$Price, breaks = c(0, 15, 30, 45, 60, 75, 90, Inf), labels = c("0-15", "15-30", "30-45", "45-60", "60-75", "75-90", ">90"))
Cars93$Price <- ordered(Cars93$Price)

# Proportional odds model
model_proportional_odds <- polr(Price ~ MPG.city + MPG.highway + EngineSize, data = Cars93)
summary(model_proportional_odds)

## 
## Re-fitting to get Hessian

## Call:
## polr(formula = Price ~ MPG.city + MPG.highway + EngineSize, data = Cars93)
## 
## Coefficients:
##               Value Std. Error t value
## MPG.city    -0.7621     0.2465 -3.0910
## MPG.highway  0.2014     0.1670  1.2061
## EngineSize   0.2833     0.3743  0.7568
## 
## Intercepts:
##             Value    Std. Error t value 
## 0-15|15-30  -11.1564   3.4265    -3.2559
## 15-30|30-45  -6.5208   3.1823    -2.0491
## 30-45|45-60  -4.2840   3.2662    -1.3116
## 45-60|60-75  -3.5746   3.3428    -1.0694
## 
## Residual Deviance: 119.4952 
## AIC: 133.4952

# Partial proportional odds model
model_partial_proportional_odds <- polr(Price ~ MPG.city + MPG.highway + EngineSize, data = Cars93, method = "logistic")
summary(model_partial_proportional_odds)

## 
## Re-fitting to get Hessian

## Call:
## polr(formula = Price ~ MPG.city + MPG.highway + EngineSize, data = Cars93, 
##     method = "logistic")
## 
## Coefficients:
##               Value Std. Error t value
## MPG.city    -0.7621     0.2465 -3.0910
## MPG.highway  0.2014     0.1670  1.2061
## EngineSize   0.2833     0.3743  0.7568
## 
## Intercepts:
##             Value    Std. Error t value 
## 0-15|15-30  -11.1564   3.4265    -3.2559
## 15-30|30-45  -6.5208   3.1823    -2.0491
## 30-45|45-60  -4.2840   3.2662    -1.3116
## 45-60|60-75  -3.5746   3.3428    -1.0694
## 
## Residual Deviance: 119.4952 
## AIC: 133.4952

# Ordinal logistic regression model
model_ordinal_logistic <- clm(Price ~ MPG.city + MPG.highway + EngineSize, data = Cars93, link = "logit")
summary(model_ordinal_logistic)

## formula: Price ~ MPG.city + MPG.highway + EngineSize
## data:    Cars93
## 
##  link  threshold nobs logLik AIC    niter max.grad cond.H 
##  logit flexible  93   -59.75 133.50 7(0)  3.04e-10 8.4e+05
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## MPG.city     -0.7621     0.2465  -3.091  0.00199 **
## MPG.highway   0.2014     0.1670   1.206  0.22777   
## EngineSize    0.2833     0.3743   0.757  0.44919   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Threshold coefficients:
##             Estimate Std. Error z value
## 0-15|15-30   -11.157      3.427  -3.256
## 15-30|30-45   -6.521      3.182  -2.049
## 30-45|45-60   -4.284      3.266  -1.312
## 45-60|60-75   -3.575      3.343  -1.069

# Ordinal probit regression model
model_ordinal_probit <- clm(Price ~ MPG.city + MPG.highway + EngineSize, data = Cars93, link = "probit")
summary(model_ordinal_probit)

## formula: Price ~ MPG.city + MPG.highway + EngineSize
## data:    Cars93
## 
##  link   threshold nobs logLik AIC    niter max.grad cond.H 
##  probit flexible  93   -59.93 133.85 7(1)  5.82e-09 7.8e+05
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## MPG.city    -0.41922    0.13153  -3.187  0.00144 **
## MPG.highway  0.09881    0.09237   1.070  0.28476   
## EngineSize   0.14504    0.19981   0.726  0.46791   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Threshold coefficients:
##             Estimate Std. Error z value
## 0-15|15-30    -6.529      1.859  -3.512
## 15-30|30-45   -3.900      1.739  -2.243
## 30-45|45-60   -2.769      1.764  -1.570
## 45-60|60-75   -2.494      1.779  -1.402

# Cumulative logit model
model_cumulative_logit <- clm(Price ~ MPG.city + MPG.highway + EngineSize, data = Cars93)
summary(model_cumulative_logit)

## formula: Price ~ MPG.city + MPG.highway + EngineSize
## data:    Cars93
## 
##  link  threshold nobs logLik AIC    niter max.grad cond.H 
##  logit flexible  93   -59.75 133.50 7(0)  3.04e-10 8.4e+05
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## MPG.city     -0.7621     0.2465  -3.091  0.00199 **
## MPG.highway   0.2014     0.1670   1.206  0.22777   
## EngineSize    0.2833     0.3743   0.757  0.44919   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Threshold coefficients:
##             Estimate Std. Error z value
## 0-15|15-30   -11.157      3.427  -3.256
## 15-30|30-45   -6.521      3.182  -2.049
## 30-45|45-60   -4.284      3.266  -1.312
## 45-60|60-75   -3.575      3.343  -1.069

aic<-AIC(model_ordinal_logistic,model_ordinal_probit,model_cumulative_logit,model_proportional_odds,model_partial_proportional_odds)
aic

##                                 df      AIC
## model_ordinal_logistic           7 133.4952
## model_ordinal_probit             7 133.8548
## model_cumulative_logit           7 133.4952
## model_proportional_odds          7 133.4952
## model_partial_proportional_odds  7 133.4952

Regression Models for Binary Variable Responses

Logistic regression: This is a type of regression model used when the dependent variable has two categories (e.g., yes or no, 1 or 0). The goal of logistic regression is to predict the probability of the positive category based on one or more predictor variables.
Probit regression: This is a type of regression model that is similar to logistic regression but assumes that the relationship between the predictor variables and the outcome variable follows a normal distribution.
Complementary log-log regression: This is a type of regression model that models the logarithm of the negative log of the probability of the positive category as a linear function of the predictor variables.
The Cauchit regression model is a type of regression model that is used when the dependent variable is binary or dichotomous, meaning that it can take on only two values, such as 0 or 1. The Cauchit regression model is a type of generalized linear model (GLM) that is based on the Cauchy distribution.
Generalized estimating equations (GEE): This is a type of regression model that is used when there are repeated measurements of the same binary outcome variable. GEE accounts for the correlation between the repeated measurements and allows for unbiased estimates of the model coefficients.

R Code

Here we analyze low birth weight data to illustrate the use of logistic regression. The dataset has been presented in Hosmer and Lemeshow (2004). The data set contains information on 189 births to women seen in the obstetric clinic, where data were collected as part of a larger study at Baystate Medical Center in Springfield, Massachusetts. The response variable LOW is a binary outcome indicating birth weight less than 2500 grams, which has been of concern to physicians for years.

# Load the dataset
library(brinla)

## Loading required package: INLA

## Loading required package: foreach

## 
## Attaching package: 'foreach'

## The following objects are masked from 'package:purrr':
## 
##     accumulate, when

## Loading required package: parallel

## Loading required package: sp

## Warning: package 'sp' was built under R version 4.1.3

## This is INLA_22.03.03 built 2022-03-03 07:31:05 UTC.
##  - See www.r-inla.org/contact-us for how to get help.

data("lowbwt")
# Fit a logistic regression model
logitmodel <- glm(LOW ~ AGE+LWT+RACE+SMOKE+HT+UI+FTV, data =lowbwt, family = binomial(link = "logit"))

# View the model summary
summary(logitmodel)

## 
## Call:
## glm(formula = LOW ~ AGE + LWT + RACE + SMOKE + HT + UI + FTV, 
##     family = binomial(link = "logit"), data = lowbwt)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.7278  -0.8291  -0.5482   0.9709   2.1962  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept)  0.45481    1.18541   0.384  0.70122   
## AGE         -0.02051    0.03595  -0.570  0.56844   
## LWT         -0.01652    0.00686  -2.409  0.01600 * 
## RACE2        1.28976    0.52761   2.445  0.01451 * 
## RACE3        0.91907    0.43630   2.106  0.03516 * 
## SMOKE1       1.04159    0.39548   2.634  0.00844 **
## HT1          1.88506    0.69481   2.713  0.00667 **
## UI1          0.90415    0.44860   2.015  0.04385 * 
## FTV          0.05912    0.17200   0.344  0.73105   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 234.67  on 188  degrees of freedom
## Residual deviance: 203.83  on 180  degrees of freedom
## AIC: 221.83
## 
## Number of Fisher Scoring iterations: 4

# Fit a Probit regression model
probitmodel <- glm(LOW ~ AGE+LWT+RACE+SMOKE+HT+UI+FTV, data =lowbwt, family = binomial(link = "probit"))
# View the model summary
summary(probitmodel)

## 
## Call:
## glm(formula = LOW ~ AGE + LWT + RACE + SMOKE + HT + UI + FTV, 
##     family = binomial(link = "probit"), data = lowbwt)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.7043  -0.8491  -0.5351   0.9740   2.2252  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)   
## (Intercept)  0.256356   0.696796   0.368  0.71294   
## AGE         -0.012640   0.021179  -0.597  0.55063   
## LWT         -0.009645   0.003966  -2.432  0.01502 * 
## RACE2        0.762234   0.314450   2.424  0.01535 * 
## RACE3        0.540807   0.253656   2.132  0.03300 * 
## SMOKE1       0.634580   0.230601   2.752  0.00593 **
## HT1          1.122746   0.415545   2.702  0.00690 **
## UI1          0.546000   0.272855   2.001  0.04539 * 
## FTV          0.022830   0.101442   0.225  0.82193   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 234.67  on 188  degrees of freedom
## Residual deviance: 203.55  on 180  degrees of freedom
## AIC: 221.55
## 
## Number of Fisher Scoring iterations: 5

# Fit a Complementary Log-logistic regression model
cloglogmodel <- glm(LOW ~ AGE+LWT+RACE+SMOKE+HT+UI+FTV, data =lowbwt, family = binomial(link = "cloglog"))

# View the model summary
summary(cloglogmodel)

## 
## Call:
## glm(formula = LOW ~ AGE + LWT + RACE + SMOKE + HT + UI + FTV, 
##     family = binomial(link = "cloglog"), data = lowbwt)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8386  -0.8125  -0.5644   0.9948   2.1519  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)   
## (Intercept) -0.116450   0.914946  -0.127  0.89872   
## AGE         -0.017513   0.028348  -0.618  0.53671   
## LWT         -0.012515   0.005378  -2.327  0.01997 * 
## RACE2        1.102487   0.394463   2.795  0.00519 **
## RACE3        0.747427   0.338748   2.206  0.02735 * 
## SMOKE1       0.817161   0.301502   2.710  0.00672 **
## HT1          1.468116   0.456304   3.217  0.00129 **
## UI1          0.687640   0.331881   2.072  0.03827 * 
## FTV          0.076347   0.135185   0.565  0.57224   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 234.67  on 188  degrees of freedom
## Residual deviance: 204.14  on 180  degrees of freedom
## AIC: 222.14
## 
## Number of Fisher Scoring iterations: 6

# Fit a Cauchit regression model
caumodel <- glm(LOW ~ AGE+LWT+RACE+SMOKE+HT+UI+FTV, data =lowbwt, family = binomial(link = "cauchit"))

# View the model summary
summary(caumodel)

## 
## Call:
## glm(formula = LOW ~ AGE + LWT + RACE + SMOKE + HT + UI + FTV, 
##     family = binomial(link = "cauchit"), data = lowbwt)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8054  -0.7911  -0.5859   0.9584   2.0566  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)   
## (Intercept)  0.643511   1.266545   0.508  0.61139   
## AGE         -0.024189   0.038579  -0.627  0.53066   
## LWT         -0.018296   0.008279  -2.210  0.02712 * 
## RACE2        1.376953   0.573070   2.403  0.01627 * 
## RACE3        1.031063   0.498556   2.068  0.03863 * 
## SMOKE1       0.951250   0.443962   2.143  0.03214 * 
## HT1          2.009150   0.770681   2.607  0.00913 **
## UI1          0.904793   0.434412   2.083  0.03727 * 
## FTV          0.154221   0.183646   0.840  0.40103   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 234.67  on 188  degrees of freedom
## Residual deviance: 205.04  on 180  degrees of freedom
## AIC: 223.04
## 
## Number of Fisher Scoring iterations: 8

# Model Comparison
aic<-AIC(logitmodel,probitmodel,cloglogmodel,caumodel)
aic

##              df      AIC
## logitmodel    9 221.8310
## probitmodel   9 221.5472
## cloglogmodel  9 222.1358
## caumodel      9 223.0378

bic<-BIC(logitmodel,probitmodel,cloglogmodel,caumodel)
bic

##              df      BIC
## logitmodel    9 251.0067
## probitmodel   9 250.7229
## cloglogmodel  9 251.3115
## caumodel      9 252.2136

Regression Models for Multinominal Variable Responses

Multinomial logistic regression: This is a type of regression model used when the dependent variable has three or more unordered categories. The goal of multinomial logistic regression is to predict the probability of each possible outcome category based on one or more predictor variables.
Conditional logistic regression: This is a variant of multinomial logistic regression that accounts for the fact that the observations within each category are not independent.
Multinomial probit regression: This is a type of regression model that is similar to multinomial logistic regression but assumes that the relationship between the predictor variables and the outcome variable follows a normal distribution.
Generalized multinomial logit model: This is a type of regression model that extends the multinomial logistic regression model to allow for more flexible relationships between the predictor variables and the outcome variable.
Discriminant analysis: This is a type of regression model that is used to predict the category of a dependent variable based on one or more predictor variables. Discriminant analysis assumes that the predictor variables follow a normal distribution.

R Code

In this example, we loaded the Iris dataset and fit a multinomial logistic regression and multinomial probit regression models to predict the species of iris based on its sepal and petal measurements. We used the multinom function from the nnet package to fit the model and then viewed the summary of the model using the summary function.

# Load the dataset
data("iris")

# Fit a multinomial logistic regression model
library(mlogit)

## Loading required package: dfidx

## 
## Attaching package: 'dfidx'

## The following object is masked from 'package:ordinal':
## 
##     slice

## The following object is masked from 'package:MASS':
## 
##     select

## The following object is masked from 'package:stats':
## 
##     filter

data("TravelMode", package = "AER")
mlmodel <- mlogit(choice ~ wait + travel + vcost, TravelMode,probit = FALSE)

# View the model summary
summary(mlmodel)

## 
## Call:
## mlogit(formula = choice ~ wait + travel + vcost, data = TravelMode, 
##     probit = FALSE, method = "nr")
## 
## Frequencies of alternatives:choice
##     air   train     bus     car 
## 0.27619 0.30000 0.14286 0.28095 
## 
## nr method
## 5 iterations, 0h:0m:0s 
## g'(-H)^-1g = 0.000192 
## successive function values within tolerance limits 
## 
## Coefficients :
##                      Estimate  Std. Error z-value  Pr(>|z|)    
## (Intercept):train -0.78666667  0.60260733 -1.3054   0.19174    
## (Intercept):bus   -1.43363372  0.68071345 -2.1061   0.03520 *  
## (Intercept):car   -4.73985647  0.86753178 -5.4636 4.665e-08 ***
## wait              -0.09688675  0.01034202 -9.3683 < 2.2e-16 ***
## travel            -0.00399468  0.00084915 -4.7043 2.547e-06 ***
## vcost             -0.01391160  0.00665133 -2.0916   0.03648 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Log-Likelihood: -192.89
## McFadden R^2:  0.32024 
## Likelihood ratio test : chisq = 181.74 (p.value = < 2.22e-16)

# Fit a multinomial probit regression model
mpmodel <- mlogit(choice ~ wait + travel + vcost, TravelMode,probit = TRUE)

# View the model summary
summary(mpmodel)

## 
## Call:
## mlogit(formula = choice ~ wait + travel + vcost, data = TravelMode, 
##     probit = TRUE)
## 
## Frequencies of alternatives:choice
##     air   train     bus     car 
## 0.27619 0.30000 0.14286 0.28095 
## 
## bfgs method
## 19 iterations, 0h:0m:6s 
## g'(-H)^-1g = 2.02E-07 
## gradient close to zero 
## 
## Coefficients :
##                      Estimate  Std. Error z-value  Pr(>|z|)    
## (Intercept):train  0.37591642  0.26845828  1.4003  0.161430    
## (Intercept):bus    0.26157459  0.24400733  1.0720  0.283722    
## (Intercept):car   -0.70272048  0.35125181 -2.0006  0.045434 *  
## wait              -0.02221466  0.00542063 -4.0982 4.164e-05 ***
## travel            -0.00181971  0.00057437 -3.1682  0.001534 ** 
## vcost             -0.00702779  0.00269126 -2.6113  0.009019 ** 
## train.bus          0.81991555  0.20779581  3.9458 7.954e-05 ***
## train.car          0.87336208  0.19131344  4.5651 4.993e-06 ***
## bus.bus            0.32955846  0.20251005  1.6274  0.103659    
## bus.car            0.26520762  0.30970850  0.8563  0.391824    
## car.car            0.38950924  0.16288629  2.3913  0.016789 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Log-Likelihood: -192.56
## McFadden R^2:  0.32139 
## Likelihood ratio test : chisq = 182.39 (p.value = < 2.22e-16)

library(nnet)

## Warning: package 'nnet' was built under R version 4.1.3

# Fit a multinomial cloglog regression model
clmodel <- multinom(choice ~ wait + travel + vcost, TravelMode, family = multinomial(link = "cloglog"))

## # weights:  5 (4 variable)
## initial  value 582.243632 
## final  value 423.777668 
## converged

# View the model summary
summary(clmodel)

## Call:
## multinom(formula = choice ~ wait + travel + vcost, data = TravelMode, 
##     family = multinomial(link = "cloglog"))
## 
## Coefficients:
##                   Values    Std. Err.
## (Intercept)  0.270041373 0.2283283572
## wait        -0.041124020 0.0047380067
## travel      -0.001801299 0.0003201124
## vcost        0.016061357 0.0032927743
## 
## Residual Deviance: 847.5553 
## AIC: 855.5553

library(nnet)
# Fit a multinomial cauchit regression model
cumodel <- multinom(choice ~ wait + travel + vcost, TravelMode, family = multinomial(link = "cauchit"))

## # weights:  5 (4 variable)
## initial  value 582.243632 
## final  value 423.777668 
## converged

# View the model summary
summary(cumodel)

## Call:
## multinom(formula = choice ~ wait + travel + vcost, data = TravelMode, 
##     family = multinomial(link = "cauchit"))
## 
## Coefficients:
##                   Values    Std. Err.
## (Intercept)  0.270041373 0.2283283572
## wait        -0.041124020 0.0047380067
## travel      -0.001801299 0.0003201124
## vcost        0.016061357 0.0032927743
## 
## Residual Deviance: 847.5553 
## AIC: 855.5553

# View the model summary
summary(clmodel)

## Call:
## multinom(formula = choice ~ wait + travel + vcost, data = TravelMode, 
##     family = multinomial(link = "cloglog"))
## 
## Coefficients:
##                   Values    Std. Err.
## (Intercept)  0.270041373 0.2283283572
## wait        -0.041124020 0.0047380067
## travel      -0.001801299 0.0003201124
## vcost        0.016061357 0.0032927743
## 
## Residual Deviance: 847.5553 
## AIC: 855.5553

library(nnet)
# Fit a multinomial cauchit regression model
cumodel <- multinom(choice ~ wait + travel + vcost, TravelMode, family = multinomial(link = "cauchit"))

## # weights:  5 (4 variable)
## initial  value 582.243632 
## final  value 423.777668 
## converged

# View the model summary
summary(cumodel)

## Call:
## multinom(formula = choice ~ wait + travel + vcost, data = TravelMode, 
##     family = multinomial(link = "cauchit"))
## 
## Coefficients:
##                   Values    Std. Err.
## (Intercept)  0.270041373 0.2283283572
## wait        -0.041124020 0.0047380067
## travel      -0.001801299 0.0003201124
## vcost        0.016061357 0.0032927743
## 
## Residual Deviance: 847.5553 
## AIC: 855.5553

# Fit a multinomial cauchit regression model
mnlmodel <- multinom(choice ~ wait + travel + vcost, TravelMode, family = multinomial(link = "logit"))

## # weights:  5 (4 variable)
## initial  value 582.243632 
## final  value 423.777668 
## converged

# View the model summary
summary(mnlmodel)

## Call:
## multinom(formula = choice ~ wait + travel + vcost, data = TravelMode, 
##     family = multinomial(link = "logit"))
## 
## Coefficients:
##                   Values    Std. Err.
## (Intercept)  0.270041373 0.2283283572
## wait        -0.041124020 0.0047380067
## travel      -0.001801299 0.0003201124
## vcost        0.016061357 0.0032927743
## 
## Residual Deviance: 847.5553 
## AIC: 855.5553

# Fit a multinomial cauchit regression model
mnpmodel <- multinom(choice ~ wait + travel + vcost, TravelMode, family = multinomial(link = "probit"))

## # weights:  5 (4 variable)
## initial  value 582.243632 
## final  value 423.777668 
## converged

# View the model summary
summary(mnpmodel)

## Call:
## multinom(formula = choice ~ wait + travel + vcost, data = TravelMode, 
##     family = multinomial(link = "probit"))
## 
## Coefficients:
##                   Values    Std. Err.
## (Intercept)  0.270041373 0.2283283572
## wait        -0.041124020 0.0047380067
## travel      -0.001801299 0.0003201124
## vcost        0.016061357 0.0032927743
## 
## Residual Deviance: 847.5553 
## AIC: 855.5553

# Model Comparison
aic<-AIC(mlmodel,clmodel,mpmodel,cumodel,mnlmodel,mnpmodel)
aic

##          df      AIC
## mlmodel   6 397.7770
## clmodel   4 855.5553
## mpmodel  12 409.1241
## cumodel   4 855.5553
## mnlmodel  4 855.5553
## mnpmodel  4 855.5553

bic<-BIC(mlmodel,clmodel,mpmodel,cumodel)
bic

##         df      BIC
## mlmodel  6       NA
## clmodel  4 874.4889
## mpmodel 12       NA
## cumodel  4 874.4889

Researcher’s Toolkit: A Comprehensive Tutorial on 40 Regression Models using R, Python, and Stata

Abdisalam Hassan Muse and Christophe Chesneau

2023-06-30

Abstract

Module I: Introduction

Background and Motivation

Objectives of the Tutorial Paper

Overview of Regression Models and Scale of Measurement}

Software Tools Used

R

Python

Stata

What this Tutorial Paper will Achieve?

Comprehensive Understanding of Regression Models

Practical Implementation Skills

Enhanced Analytical Capabilities

Streamlined Data Analysis Processes

Empowered Research Methodology

Outline of the Paper

Module II: Regression Models for Continuous Variable Responses

Introduction

Multiple Linear Regression Model

Model Formulation

Model Assumptions

Model Alternatives

Alternatives to Multiple Linear Regression Model

Model Comparisons

Multiple Linear Regression:

Ridge Regression:

Quantile Regression:

LASSO Regression:

Polynomial Regression:

Elastic Net Regression:

Principal Components Regression:

Partial Least Squares Regression:

Real-life Applications

Multiple Linear Regression:

Ridge Regression:

Quantile Regression:

LASSO Regression:

Polynomial Regression:

Elastic Net Regression:

Principal Components Regression:

Partial Least Squares Regression:

Exam Question

Regression Models for Continuous Variable Responses alternative to Multiple Linear Regression Model

Real-life Examples using all the 10 models with continous variable responses

Multiple Linear Regression:

Polynomial Regression:

Ridge Regression:

LASSO Regression:

Robust Regression:

Elastic Net Regression:

Support Vector Regression:

Principal Component Regression:

Partial Least Squares Regression:

Model Adequacy Checking

Introduction

Goodness of fit

Checking for violations of model assumptions

Identifying influential observations

Metrics for model adequacy checking

Root Mean Squared Error (RMSE):

Mean Absolute Error (MAE):

Mean Squared Prediction Error (MSPE):

Conclusion

Real-life Examples for Model Diagnostic Checking

Model Comparison

Akaike Information Criterion (AIC)

Bayesian Information Criterion (BIC)

Hannan-Quinn Information Criterion (HQIC)

Conclusion

Real-life Example

Module III: Regression Models for Count (Discrete) Variable Responses

Introduction

Poisson Regression

Introduction:

Model Formulation:

Model Assumptions:

Application to Real-life Data: