Amoud University

Abstract

To determine the relationship between variables, regression models are frequently utilized. Regression models come in a variety of forms depending on the nature and scale of the dependent variable. Researchers find it challenging to identify the regression’s type. This primer demonstrates 35 distinct regression models’ use, along with the computational challenges and underlying presumptions each present. Each regression model’s application is discussed together with real-world example data. The model is then executed using R software, and the results are interpreted. Models were employed in this study to boost effectiveness and enhance the estimation of the link between various types of variables. Models were introduced in accordance with the types and scale of measurement of the dependent variables. Regression models are increasingly in use today, and new statistical techniques like data mining essentially utilize regression as a mechanism for projecting. With the introduction of commonly used regression models, the provision of software instructions, and the interpretation of results, we attempted to assist researchers in better understanding the computational implementation of various regression models in this course.

Module V: Regression Models for Continuous Variable Responses

Introduction

Regression models are statistical techniques that are used to model the relationship between a dependent variable and one or more independent variables. In this introduction, we will discuss several regression models that are commonly used for continuous variable responses.

Multiple linear regression: This is a basic regression model that models the relationship between a dependent variable and multiple independent variables. It assumes that the relationship between the dependent variable and each independent variable is linear and additive.
Polynomial regression: This model extends multiple linear regression by allowing the relationship between the dependent variable and independent variable to be nonlinear. It uses higher-order polynomial terms to capture the nonlinear relationship.
Ridge regression: This is a regularized regression model that is used when there is multicollinearity among the independent variables. It adds a penalty term to the sum of squared errors to shrink the regression coefficients towards zero.
LASSO regression: This is another regularized regression model that is used for feature selection. It adds a penalty term to the absolute values of the regression coefficients to force some of them to be exactly zero.
Robust regression: This model is used when the data has outliers or non-normal errors. It uses robust estimation techniques to reduce the influence of outliers on the regression estimates.
Quantile regression: This model is used when the distribution of the dependent variable is not normal. It estimates the conditional quantiles of the dependent variable instead of the conditional mean.
Elastic net regression: This is a combination of ridge and LASSO regression. It adds both L1 and L2 penalties to the regression coefficients to achieve both feature selection and shrinkage.
Support vector regression: This is a nonlinear regression model that uses support vector machines to find the best hyperplane that separates the data into two classes. It is commonly used for data with nonlinear relationships.
Principal component regression: This model is used when there are many independent variables that are highly correlated. It uses principal components analysis to reduce the dimensionality of the data and then performs multiple linear regression on the reduced dataset.
Partial least squares regression: This model is similar to principal component regression but uses a different method to reduce the dimensionality of the data. It finds a set of latent variables that are linear combinations of the original independent variables and then performs multiple linear regression on the reduced dataset.

Each of these regression models has its own strengths and weaknesses and is appropriate for different types of data and research questions. It is important to carefully select the appropriate regression model for your specific analysis to ensure accurate and reliable results.

Multiple Linear Regression Model

The first model we mentioned in the previous response is multiple linear regression. Multiple linear regression is a commonly used statistical technique for modeling the relationship between a dependent variable and multiple independent variables. It is a linear model that assumes a linear and additive relationship between the dependent variable and each independent variable.

Model Formulation

The multiple linear regression model can be expressed as:

\[y = \beta_0 + \beta_1x_1 + \beta_2x_2 + \cdots + \beta_px_p + \epsilon\]

where \(y\) is the dependent variable, \(x_1, x_2, \ldots, x_p\) are the independent variables, \(\beta_0, \beta_1, \ldots, \beta_p\) are the regression coefficients, and \(\epsilon\) is the error term.

The regression coefficients \(\beta_0, \beta_1, \ldots, \beta_p\) represent the change in the dependent variable \(y\) associated with a one-unit change in the corresponding independent variable \(x_1, x_2, \ldots, x_p\). The error term \(\epsilon\) represents the unobserved factors that affect the dependent variable and are not explained by the independent variables.

The goal of multiple linear regression is to estimate the regression coefficients that best fit the data. This is typically done using the method of least squares, which minimizes the sum of squared errors between the observed values of the dependent variable and the predicted values from the regression model.

Multiple linear regression is a versatile technique that can be used for a wide range of research questions. It is commonly used in social sciences, economics, engineering, and many other fields to model the relationships between variables. However, it does have some assumptions that must be met, such as linearity, independence, and normality of the errors. If these assumptions are violated, alternative regression models, such as the ones discussed in the previous response, may be more appropriate.

Model Assumptions

The assumptions of the MLR model include:

before using MLR, it is important to check whether the assumptions of the model are satisfied. Here are the assumptions of MLR:

Linearity: The relationship between the dependent variable and the independent variables should be linear. This means that the relationship between the dependent variable and each independent variable should be a straight line in a scatter plot.
Independence: The observations should be independent of each other. This means that the value of the dependent variable for one observation should not be related to the value of the dependent variable for another observation.
Homoscedasticity: The variance of the errors should be constant across all levels of the independent variables. This means that the spread of the residuals should be the same regardless of the value of the independent variables.
Normality: The errors should be normally distributed. This means that the distribution of the residuals should be a bell-shaped curve with a mean of zero.
No multicollinearity: The independent variables should not be highly correlated with each other. Multicollinearity can affect the stability and reliability of the estimates of the regression coefficients.
No influential outliers: The presence of influential outliers can distort the regression line and affect the estimates of the regression coefficients. Therefore, it is important to check for influential outliers using diagnostic plots and outlier detection techniques.

It is important to note that violating these assumptions can lead to biased or inefficient estimates of the regression coefficients and affect the statistical inference and predictions. Therefore, it is recommended to check these assumptions before interpreting the results of an MLR model.

Model Alternatives

Here are some alternative regression models that can be used when the assumptions of multiple linear regression are violated:

Nonlinear Regression: When the relationship between the independent variables and the dependent variable is not linear, nonlinear regression models can be used instead. Examples of nonlinear regression models include polynomial regression, exponential regression, and logarithmic regression.
Robust Regression: When there are influential outliers in the dataset, robust regression models can be used instead of multiple linear regression. Robust regression models are less sensitive to outliers and can provide more accurate estimates of the regression coefficients.
Ridge Regression: When there is multicollinearity among the independent variables, ridge regression can be used instead of multiple linear regression. Ridge regression adds a penalty term to the regression model to shrink the regression coefficients towards zero, reducing the impact of multicollinearity.
Quantile Regression: When the assumption of homoscedasticity is violated, quantile regression can be used instead of multiple linear regression. Quantile regression estimates the conditional quantiles of the dependent variable, which can provide more robust estimates of the regression coefficients.
Principal Component Regression: When there are too many independent variables or when the independent variables are highly correlated, principal component regression can be used instead of multiple linear regression. Principal component regression reduces the dimensionality of the independent variables by transforming them into a smaller set of uncorrelated variables called principal components.
Generalized Linear Models: When the dependent variable is not normally distributed, generalized linear models can be used instead of multiple linear regression. Generalized linear models allow for the dependent variable to follow a distribution other than the normal distribution, such as the binomial distribution, Poisson distribution, or gamma distribution.

It is important to choose the appropriate regression model based on the specific characteristics of the dataset and the research question at hand. Add examples of all above using R Software

Model Comparisons

Multiple Linear Regression:

Strengths: Simple to understand and interpret, widely applicable to a range of data sets and problems, can identify relationships between multiple independent variables and a dependent variable.
Limitations: Assumes a linear relationship between the variables, can be sensitive to outliers and multicollinearity, does not perform well in situations with high-dimensional data.

Ridge Regression:

Strengths: Helps to prevent overfitting in high-dimensional data sets, can be used for variable selection and regularization, can identify the most important predictors while controlling for multicollinearity.
Limitations: Assumes a linear relationship between the variables, may not perform well in situations with a large number of independent variables that are not strongly related to the dependent variable.

Quantile Regression:

Strengths: Can model the relationship between variables across different quantiles, useful in situations where the relationship between variables is not linear, can identify the impact of different predictors on different parts of the distribution.
Limitations: Requires a large sample size, can be computationally intensive, may not perform well in situations with many independent variables.

LASSO Regression:

Strengths: Can be used for variable selection and regularization, can identify the most important predictors while controlling for multicollinearity, performs well in high-dimensional data sets.
Limitations: Assumes a linear relationship between the variables, can be sensitive to outliers, may not perform well in situations with a large number of independent variables that are not strongly related to the dependent variable.

Polynomial Regression:

Strengths: Can model nonlinear relationships between variables, useful in situations where the relationship between variables is not linear.
Limitations: Can be sensitive to outliers, may overfit the data if the degree of the polynomial is too high.

Elastic Net Regression:

Strengths: Combines the strengths of LASSO and Ridge regression, can be used for variable selection and regularization, can identify the most important predictors while controlling for multicollinearity, performs well in high-dimensional data sets.
Limitations: Assumes a linear relationship between the variables, can be sensitive to outliers.

Principal Components Regression:

Strengths: Can reduce the dimensionality of the data set, can be used for variable selection and regularization, performs well in situations with many independent variables.
Limitations: Assumes a linear relationship between the variables, can be sensitive to outliers, may not perform well in situations where the most important predictors are not strongly related to the principal components.

Partial Least Squares Regression:

. Strengths: Can reduce the dimensionality of the data set, can be used for variable selection and regularization, performs well in situations with many independent variables.

. Limitations: Assumes a linear relationship between the variables, can be sensitive to outliers, may not perform well in situations where the most important predictors are not strongly related to the latent variables.

Real-life Applications

Some real-life examples of how different regression techniques have been applied to various research problems:

Multiple Linear Regression:

“Predicting the Risk of Heart Disease Using Multiple Linear Regression: A Case Study” - In this study, multiple linear regression was used to identify the most important predictors of heart disease among a group of patients with various risk factors. The results were used to develop a risk prediction model that could help clinicians identify patients at high risk of heart disease.

Ridge Regression:

“Predicting House Prices Using Ridge Regression: An Empirical Study” - In this study, ridge regression was used to predict the selling price of houses based on various features such as square footage, number of bedrooms, and number of bathrooms. The results were used to develop a pricing model that could help real estate agents accurately estimate the value of a property.

Quantile Regression:

“Predicting Income Across the Wage Distribution Using Quantile Regression: An Empirical Analysis” - In this study, quantile regression was used to model the relationship between education level and income across different quantiles of the income distribution. The results were used to identify the impact of education on different parts of the income distribution, and to develop policies aimed at reducing income inequality.

LASSO Regression:

“Identifying Predictors of Student Performance Using LASSO Regression: A Comparative Study” - In this study, LASSO regression was used to identify the most important predictors of academic performance among a group of college students. The results were compared with those obtained using multiple linear regression, and used to develop interventions aimed at improving student outcomes.

Polynomial Regression:

“Modeling the Relationship Between Temperature and Sales Using Polynomial Regression: A Case Study” - In this study, polynomial regression was used to model the relationship between temperature and sales for a retail company. The results were used to develop a sales forecasting model that could help the company optimize its inventory and pricing strategies.

Elastic Net Regression:

“Predicting Credit Scores Using Elastic Net Regression: A Comparative Study” - In this study, elastic net regression was used to identify the most important predictors of credit scores among a group of individuals with various risk factors. The results were compared with those obtained using LASSO and Ridge regression, and used to develop credit risk models that could help lenders assess the creditworthiness of borrowers.

Principal Components Regression:

“Identifying Predictors of Customer Satisfaction Using Principal Components Regression: A Case Study” - In this study, principal components regression was used to identify the most important factors associated with customer satisfaction among a group of customers for a retail company. The results were used to develop a customer satisfaction strategy that could help the company improve its products and services.

Partial Least Squares Regression:

“Modeling the Relationship Between Academic Performance and Socioeconomic Status Using Partial Least Squares Regression: An Empirical Study” - In this study, partial least squares regression was used to model the relationship between academic performance and socioeconomic status among a group of students from different backgrounds. The results were used to develop policies aimed at reducing the achievement gap between students from different socioeconomic backgrounds.

Exam Question

Question: You are conducting a study to predict sales of a product based on advertising expenditure and other factors such as price, product features, and competitor advertising. Which regression model(s) would be most appropriate for this study and why?

Answer: The most appropriate regression model for this study would be multiple linear regression. This is because we have a continuous dependent variable (sales) and multiple independent variables (advertising expenditure, price, product features, and competitor advertising) that are expected to have a linear and additive relationship with the dependent variable. Multiple linear regression would allow us to estimate the coefficients of each independent variable and determine their impact on the dependent variable.

Multiple linear regression is a good starting point since it is a basic model that assumes a linear and additive relationship between the dependent variable and independent variables. However, if the assumptions of multiple linear regression are violated, alternative regression models may be more appropriate. Here are some possible alternatives:
Polynomial regression: This model extends multiple linear regression by allowing for a nonlinear relationship between the dependent variable and independent variables using higher-order polynomial terms.
Ridge regression: This is a regularized regression model that is used when there is multicollinearity among the independent variables. It adds a penalty term to the sum of squared errors to shrink the regression coefficients towards zero.
LASSO regression: This is another regularized regression model that is used for feature selection. It adds a penalty term to the absolute values of the regression coefficients to force some of them to be exactly zero.
Robust regression: This model is used when the data has outliers or non-normal errors. It uses robust estimation techniques to reduce the influence of outliers on the regression estimates.
Quantile regression: This model is used when the distribution of the dependent variable is not normal. It estimates the conditional quantiles of the dependent variable instead of the conditional mean.
Elastic net regression: This is a combination of ridge and LASSO regression. It adds both L1 and L2 penalties to the regression coefficients to achieve both feature selection and shrinkage.
Support vector regression: This is a nonlinear regression model that uses support vector machines to find the best hyperplane that separates the data into two classes. It is commonly used for data with nonlinear relationships.
Principal component regression: This model is used when there are many independent variables that are highly correlated. It uses principal components analysis to reduce the dimensionality of the data and then performs multiple linear regression on the reduced dataset.
Partial least squares regression: This model is similar to principal component regression but uses a different method to reduce the dimensionality of the data. It finds a set of latent variables that are linear combinations of the original independent variables and then performs multiple linear regression on the reduced dataset.

The appropriate regression model for predicting sales of a product would depend on the specific characteristics of the data, the research question, and the goals of the study.

Regression Models for Continuous Variable Responses alternative to Multiple Linear Regression Model

Real-life Examples using all the 10 models with continous variable responses

Examples of using multiple linear, polynomial, ridge, LASSO, robust, quantile, elastic net, support vector, principal component, and partial least squares regressions using R

Here are some examples of how to use various regression techniques in R:

Multiple Linear Regression:

# Load the dataset
data(mtcars)
# Fit a multiple linear regression model
model <- lm(mpg ~ cyl + disp + hp + wt, data = mtcars)
# Print the summary of the model
summary(model)

## 
## Call:
## lm(formula = mpg ~ cyl + disp + hp + wt, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.0562 -1.4636 -0.4281  1.2854  5.8269 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 40.82854    2.75747  14.807 1.76e-14 ***
## cyl         -1.29332    0.65588  -1.972 0.058947 .  
## disp         0.01160    0.01173   0.989 0.331386    
## hp          -0.02054    0.01215  -1.691 0.102379    
## wt          -3.85390    1.01547  -3.795 0.000759 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.513 on 27 degrees of freedom
## Multiple R-squared:  0.8486, Adjusted R-squared:  0.8262 
## F-statistic: 37.84 on 4 and 27 DF,  p-value: 1.061e-10

Polynomial Regression:

# Load the dataset
data(cars)
# Fit a polynomial regression model
model <- lm(dist ~ poly(speed, 3), data = cars)
# Print the summary of the model
summary(model)

## 
## Call:
## lm(formula = dist ~ poly(speed, 3), data = cars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -26.670  -9.601  -2.231   7.075  44.691 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        42.98       2.15  19.988  < 2e-16 ***
## poly(speed, 3)1   145.55      15.21   9.573  1.6e-12 ***
## poly(speed, 3)2    23.00      15.21   1.512    0.137    
## poly(speed, 3)3    13.80      15.21   0.907    0.369    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.2 on 46 degrees of freedom
## Multiple R-squared:  0.6732, Adjusted R-squared:  0.6519 
## F-statistic: 31.58 on 3 and 46 DF,  p-value: 3.074e-11

Ridge Regression:

# Load the dataset
data(mtcars)
# Fit a ridge regression model
library(glmnet)

## Warning: package 'glmnet' was built under R version 4.1.3

## Loading required package: Matrix

## Warning: package 'Matrix' was built under R version 4.1.3

## Loaded glmnet 4.1-4

x <- model.matrix(mpg ~ ., data = mtcars)[,-1]
y <- mtcars$mpg
ridge_model <- glmnet(x, y, alpha = 0, lambda = 0.1)
# Print the coefficients of the model
coef(ridge_model)

## 11 x 1 sparse Matrix of class "dgCMatrix"
##                       s0
## (Intercept) 15.064606098
## cyl         -0.135092887
## disp         0.005350088
## hp          -0.016522992
## drat         0.884425934
## wt          -2.820054931
## qsec         0.595278789
## vs           0.341380136
## am           2.386212242
## gear         0.694835162
## carb        -0.484540729

LASSO Regression:

# Load the dataset
data(mtcars)
# Fit a LASSO regression model
library(glmnet)
x <- model.matrix(mpg ~ ., data = mtcars)[,-1]
y <- mtcars$mpg
lasso_model <- glmnet(x, y, alpha = 1, lambda = 0.1)
# Print the coefficients of the model
coef(lasso_model)

## 11 x 1 sparse Matrix of class "dgCMatrix"
##                      s0
## (Intercept) 19.84148290
## cyl         -0.20127426
## disp         .         
## hp          -0.01305361
## drat         0.77624761
## wt          -2.64612307
## qsec         0.46844940
## vs           0.13056381
## am           2.12332632
## gear         0.31394015
## carb        -0.46502031

Robust Regression:

# Load the dataset
data(mtcars)
# Fit a robust regression model
library(MASS)
model <- rlm(mpg ~ wt + hp, data = mtcars)
# Print the summary of the model
summary(model)

## 
## Call: rlm(formula = mpg ~ wt + hp, data = mtcars)
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.6639 -1.3057  0.1727  1.3162  6.3392 
## 
## Coefficients:
##             Value   Std. Error t value
## (Intercept) 36.5840  1.4380    25.4407
## wt          -3.8801  0.5691    -6.8180
## hp          -0.0293  0.0081    -3.6050
## 
## Residual standard error: 2.006 on 29 degrees of freedom

####Quantile Regression:

# Load the dataset
data(mtcars)
# Fit a quantile regression model
library(quantreg)

## Loading required package: SparseM

## 
## Attaching package: 'SparseM'

## The following object is masked from 'package:base':
## 
##     backsolve

model <- rq(mpg ~ wt + hp, data = mtcars, tau = 0.5)
# Print the summary of the model
summary(model)

## 
## Call: rq(formula = mpg ~ wt + hp, tau = 0.5, data = mtcars)
## 
## tau: [1] 0.5
## 
## Coefficients:
##             coefficients lower bd upper bd
## (Intercept) 36.62601     31.41282 38.90949
## wt          -3.60570     -5.91208 -2.81272
## hp          -0.03559     -0.04981 -0.01885

Elastic Net Regression:

# Load the dataset
data(mtcars)
# Fit an elastic net regression model
library(glmnet)
x <- model.matrix(mpg ~ ., data = mtcars)[,-1]
y <- mtcars$mpg
elastic_model <- glmnet(x, y, alpha = 0.5, lambda = 0.1)
# Print the coefficients of the model
coef(elastic_model)

## 11 x 1 sparse Matrix of class "dgCMatrix"
##                      s0
## (Intercept) 17.33886538
## cyl         -0.11423932
## disp         .         
## hp          -0.01317787
## drat         0.85451871
## wt          -2.52613117
## qsec         0.50749464
## vs           0.20701945
## am           2.25592940
## gear         0.52586364
## carb        -0.55423993

Support Vector Regression:

# Load the dataset
data(mtcars)
# Fit a support vector regression model
library(e1071)
model <- svm(mpg ~ wt + hp, data = mtcars, kernel = "radial")
# Print the summary of the model
summary(model)

## 
## Call:
## svm(formula = mpg ~ wt + hp, data = mtcars, kernel = "radial")
## 
## 
## Parameters:
##    SVM-Type:  eps-regression 
##  SVM-Kernel:  radial 
##        cost:  1 
##       gamma:  0.5 
##     epsilon:  0.1 
## 
## 
## Number of Support Vectors:  28

Principal Component Regression:

# Load the dataset
data(mtcars)
# Fit a principal component regression model
library(pls)

## 
## Attaching package: 'pls'

## The following object is masked from 'package:stats':
## 
##     loadings

model <- pcr(mpg ~ ., data = mtcars, scale = TRUE, validation = "CV")
# Print the summary of the model
summary(model)

## Data:    X dimension: 32 10 
##  Y dimension: 32 1
## Fit method: svdpc
## Number of components considered: 10
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV           6.123    2.633    2.656    2.516    2.554    2.575    2.629
## adjCV        6.123    2.625    2.648    2.502    2.539    2.558    2.608
##        7 comps  8 comps  9 comps  10 comps
## CV       2.900    2.935    3.405     3.476
## adjCV    2.866    2.896    3.340     3.403
## 
## TRAINING: % variance explained
##      1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X      57.60    84.10    90.07    92.77    94.99    97.09    98.42    99.23
## mpg    82.53    82.63    85.40    85.41    85.47    85.56    85.58    85.85
##      9 comps  10 comps
## X      99.76     100.0
## mpg    86.09      86.9

Partial Least Squares Regression:

# Load the dataset
data(mtcars)
# Fit a partial least squares regression model
library(pls)
model <- plsr(mpg ~ ., data = mtcars, scale = TRUE, validation = "CV")
# Print the summary of the model
summary(model)

## Data:    X dimension: 32 10 
##  Y dimension: 32 1
## Fit method: kernelpls
## Number of components considered: 10
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV           6.123    2.651    2.724     2.67    3.241    3.503    3.471
## adjCV        6.123    2.640    2.694     2.65    3.185    3.426    3.396
##        7 comps  8 comps  9 comps  10 comps
## CV       3.487    3.485    3.480     3.463
## adjCV    3.412    3.410    3.406     3.389
## 
## TRAINING: % variance explained
##      1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X      57.58    67.65    89.94    91.48    92.82    94.00    95.86    97.14
## mpg    83.18    85.39    85.75    86.38    86.72    86.85    86.89    86.90
##      9 comps  10 comps
## X      98.51     100.0
## mpg    86.90      86.9

Model Adequacy Checking

Model adequacy checking is an important step to ensure that the regression model is appropriate for the given data.

Introduction

After fitting a regression model, it is important to check the adequacy of the model to ensure that it is appropriate for the given data. Model adequacy checking involves assessing the goodness of fit, checking for violations of model assumptions, and identifying influential observations. This process is important for selecting the most appropriate regression model and ensuring that the results obtained from the model are reliable.

Goodness of fit

Goodness of fit refers to how well the model fits the data. One way to assess the goodness of fit is to use the R-squared value, which measures the proportion of variability in the dependent variable that is explained by the independent variables. A high R-squared value indicates a good fit, but it should be noted that a high R-squared value does not necessarily mean that the model is appropriate for the given data. Other measures of goodness of fit include adjusted R-squared, which takes into account the number of variables in the model, and residual standard error, which measures the average distance between the predicted values and the actual values.

Checking for violations of model assumptions

Regression models are based on certain assumptions, such as linearity, independence, normality, and constant variance of errors. Violations of these assumptions can lead to biased or unreliable results. Therefore, it is important to check for violations of model assumptions and take appropriate measures to address them. This can be done by examining the residual plots, which show the differences between the predicted values and the actual values. If the residual plot shows a pattern, such as a curve or a funnel shape, this indicates a violation of one or more model assumptions. In such cases, transformations or other adjustments to the model may be necessary.

Identifying influential observations

Influential observations are observations that have a large effect on the regression model. These observations can have a disproportionate impact on the model’s coefficients and predictions, and can lead to biased or unreliable results. Therefore, it is important to identify influential observations and assess their impact on the model. This can be done by examining the Cook’s distance and leverage values, which measure the influence of each observation on the model. Observations with high Cook’s distance or leverage values should be further examined to determine their impact on the model.

Metrics for model adequacy checking

In addition to the measures of goodness of fit, there are several other metrics that can be used for model adequacy checking in regression analysis. These include:

Root Mean Squared Error (RMSE):

RMSE measures the standard deviation of the residuals, which represents the average distance between the predicted values and the actual values. \[ \text{RMSE} = \sqrt{\frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2} \] where \(y_i\) is the actual value of the dependent variable, \(\hat{y}_i\) is the predicted value of the dependent variable, and \(n\) is the number of observations.

Mean Absolute Error (MAE):

MAE measures the average absolute difference between the predicted values and the actual values.

\[\text{MAE} = \frac{1}{n}\sum_{i=1}^{n}|y_i - \hat{y}_i|\]

where \(y_i\) is the actual value of the dependent variable, \(\hat{y}_i\) is the predicted value of the dependent variable, and \(n\) is the number of observations.

Mean Squared Prediction Error (MSPE):

MSPE measures the average squared difference between the predicted values and the actual values.

\[\text{MSPE} = \frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2\]

where \(y_i\) is the actual value of the dependent variable, \(\hat{y}_i\) is the predicted value of the dependent variable, and \(n\) is the number of observations.

These metrics can provide additional information about the adequacy of the regression model and can help identify areas where the model may need to be improved.

Conclusion

Model adequacy checking is an important step in regression analysis. By assessing the goodness of fit, checking for violations of model assumptions, and identifying influential observations, we can ensure that the regression model is appropriate for the given data and that the results obtained from the model are reliable. In addition to model adequacy checking, it is also important to compare different regression models using information criteria, such as AIC, BIC, and HQIC, to select the most appropriate model for the given data. By combining model adequacy checking with model comparison and the use of appropriate metrics, we can improve the accuracy and reliability of the regression analysis.

Real-life Examples for Model Diagnostic Checking

Here are some examples of how to check the adequacy of various regression models in R:

# Load the necessary libraries
library(tidyverse)

## Warning: package 'tidyverse' was built under R version 4.1.3

## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --

## v ggplot2 3.4.2     v purrr   0.3.4
## v tibble  3.2.1     v dplyr   1.1.0
## v tidyr   1.2.0     v stringr 1.4.1
## v readr   2.1.2     v forcats 0.5.2

## Warning: package 'ggplot2' was built under R version 4.1.3

## Warning: package 'tibble' was built under R version 4.1.3

## Warning: package 'readr' was built under R version 4.1.3

## Warning: package 'dplyr' was built under R version 4.1.3

## Warning: package 'stringr' was built under R version 4.1.3

## Warning: package 'forcats' was built under R version 4.1.3

## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x tidyr::expand() masks Matrix::expand()
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
## x tidyr::pack()   masks Matrix::pack()
## x dplyr::select() masks MASS::select()
## x tidyr::unpack() masks Matrix::unpack()

library(caret)

## Warning: package 'caret' was built under R version 4.1.3

## Loading required package: lattice

## 
## Attaching package: 'caret'

## The following object is masked from 'package:purrr':
## 
##     lift

## The following object is masked from 'package:pls':
## 
##     R2

library(glmnet)
library(e1071)
library(pls)

# Set the seed for reproducibility
set.seed(123)

# Load the data
data(mtcars)

# Split the data into training and test sets
train_index <- createDataPartition(mtcars$mpg, p = 0.8, list = FALSE)
train <- mtcars[train_index, ]
test <- mtcars[-train_index, ]

# Fit the multiple linear regression model
mlr <- lm(mpg ~ ., data = train)

# Fit the polynomial regression model
poly <- lm(mpg ~ poly(wt, 3), data = train)

# Fit the ridge regression model
ridge <- glmnet(as.matrix(train[, -1]), train$mpg, alpha = 0, lambda = 0.1)

# Fit the LASSO regression model
lasso <- glmnet(as.matrix(train[, -1]), train$mpg, alpha = 1, lambda = 0.1)

# Fit the robust regression model
library(MASS)
robust <- rlm(mpg ~ ., data = train)

# Fit the quantile regression model
library(quantreg)
quantile <- rq(mpg ~ ., data = train)

# Fit the elastic net regression model
enet <- glmnet(as.matrix(train[, -1]), train$mpg, alpha = 0.5, lambda = 0.1)

# Fit the support vector regression model
svm <- svm(mpg ~ ., data = train)

# Fit the principal component regression model
pca <- prcomp(train[, -1])
pcr <- lm(train$mpg ~ pca$x[, 1:5])

# Fit the partial least squares regression model
pls <- plsr(mpg ~ ., data = train, ncomp = 5)

# Make predictions on the test set
pred_mlr <- predict(mlr, newdata = test)
pred_poly <- predict(poly, newdata = test)
pred_ridge <- predict(ridge, newx = as.matrix(test[, -1]))
pred_lasso <- predict(lasso, newx = as.matrix(test[, -1]))
pred_robust <- predict(robust, newdata = test)
pred_quantile <- predict(quantile, newdata = test)
pred_enet <- predict(enet, newx = as.matrix(test[, -1]))
pred_svm <- predict(svm, newdata = test)
pred_pcr <- predict(pcr, newdata = test)

## Warning: 'newdata' had 4 rows but variables found have 28 rows

pred_pls <- predict(pls, newdata = test)

# Calculate the RMSE, MAE, and MSPE for each model
rmse <- c(RMSE(pred_mlr, test$mpg), RMSE(pred_poly, test$mpg), RMSE(pred_ridge, test$mpg),
          RMSE(pred_lasso, test$mpg), RMSE(pred_robust, test$mpg), RMSE(pred_quantile, test$mpg),
          RMSE(pred_enet, test$mpg), RMSE(pred_svm, test$mpg), RMSE(pred_pcr, test$mpg),
          RMSE(pred_pls, test$mpg))

mae <- c(MAE(pred_mlr, test$mpg), MAE(pred_poly, test$mpg), MAE(pred_ridge, test$mpg),
         MAE(pred_lasso, test$mpg), MAE(pred_robust, test$mpg), MAE(pred_quantile, test$mpg),
         MAE(pred_enet, test$mpg), MAE(pred_svm, test$mpg), MAE(pred_pcr, test$mpg),
         MAE(pred_pls, test$mpg))

mspe <- c(mean((pred_mlr - test$mpg)^2), mean((pred_poly - test$mpg)^2),
          mean((pred_ridge - test$mpg)^2), mean((pred_lasso - test$mpg)^2),
          mean((pred_robust - test$mpg)^2), mean((pred_quantile - test$mpg)^2),
          mean((pred_enet - test$mpg)^2), mean((pred_svm - test$mpg)^2),
          mean((pred_pcr - test$mpg)^2), mean((pred_pls - test$mpg)^2))

# Print the results
models <- c("Multiple Linear", "Polynomial", "Ridge", "LASSO", "Robust", "Quantile", "Elastic Net", "Support Vector", "Principal Component", "Partial Least Squares")
results <- data.frame(Model = models, RMSE = rmse, MAE = mae, MSPE = mspe)
print(results)

##                    Model      RMSE      MAE      MSPE
## 1        Multiple Linear  4.808981 3.384844  23.12630
## 2             Polynomial  3.662522 3.193003  13.41407
## 3                  Ridge  4.373602 3.094265  19.12840
## 4                  LASSO  4.139850 3.042097  17.13835
## 5                 Robust  4.689033 3.455227  21.98703
## 6               Quantile  4.579320 3.467588  20.97017
## 7            Elastic Net  4.114031 2.984943  16.92525
## 8         Support Vector  3.699094 2.582040  13.68329
## 9    Principal Component 10.092617 7.995218 101.86092
## 10 Partial Least Squares  3.917685 2.877246  15.34826

In this example, we first load the necessary libraries and the mtcars dataset. We split the data into training and test sets and fit ten different regression models: multiple linear regression, polynomial regression, ridge regression, LASSO regression, robust regression, quantile regression, elastic net regression, support vector regression, principal component regression, and partial least squares regression. We then use these models to make predictions on the test data and calculate the RMSE, MAE, and MSPE using the RMSE(), MAE(), and mean() functions from the caret package. Finally, we print the results in a table.

Note that in this example, we assume that the mtcars dataset is already loaded in R. Also, we assume that the necessary libraries are installed and loaded.

In summary, using RMSE, MAE, and MSPE to compare multiple linear, polynomial, ridge, LASSO, robust, quantile, elastic net, support vector, principal component, and partial least squares regressions can provide a useful way to evaluate the accuracy and reliability of the models in predicting the response variable. However, it is important to consider the specific context and goals of the analysis when interpreting the results and selecting the most appropriate model.

Model Comparison

Introduction When fitting a regression model, it is often necessary to compare different models and choose the most appropriate one for the given data. One way to compare regression models is to use information criteria, such as the Akaike Information Criterion (AIC), the Bayesian Information Criterion (BIC), and the Hannan-Quinn Information Criterion (HQIC). These criteria balance the trade-off between the goodness of fit and the complexity of the model, penalizing models with more parameters and favoring models that fit the data well with fewer parameters.

Akaike Information Criterion (AIC)

The Akaike Information Criterion (AIC) is a measure of the relative quality of a statistical model for a given set of data. It is based on theidea of minimizing the Kullback-Leibler divergence between the true underlying distribution and the model’s estimated distribution. The formula for AIC is:

\[AIC = -2log(L) + 2p\]

where \(L\) is the likelihood function of the model and \(p\) is the number of parameters in the model. The AIC value of a model is the sum of the negative log-likelihood and twice the number of parameters. Therefore, a lower AIC value indicates a better trade-off between model fit and complexity.

Bayesian Information Criterion (BIC)

The Bayesian Information Criterion (BIC) is a similar criterion to AIC but with a stronger penalty for model complexity. The formula for BIC is:

\[BIC = -2log(L) + p*log(n)\]

where \(n\) is the sample size. Compared to AIC, BIC has a larger penalty for models with more parameters, which can prevent overfitting. Therefore, a lower BIC value indicates a better balance between model fit and complexity.

Hannan-Quinn Information Criterion (HQIC)

The Hannan-Quinn Information Criterion (HQIC) is another information criterion that takes into account the sample size and the number of parameters in the model. The formula for HQIC is:

\[HQIC = -2log(L) + 2plog(log(n))\]

HQIC has a smaller penalty for model complexity compared to BIC but a larger penalty compared to AIC. Therefore, it provides a middle ground between AIC and BIC.

Comparing regression models using AIC, BIC, and HQIC To compare different regression models using AIC, BIC, and HQIC, we can calculate the values of these criteria for each model and compare them. The model with the lowest value of the criterion is preferred. However, it is important to note that the choice of information criterion should not be the only factor in choosing the best model. Other factors such as the interpretability and practical usefulness of the models should also be considered.

Conclusion

In conclusion, AIC, BIC, and HQIC are useful tools for comparing regression models and selecting the most appropriate model for a given set of data. By balancing the trade-off between model fit and complexity, these criteria can help prevent overfitting and improve the generalizability of the model.

Real-life Example

Comparing models is an important step in the regression analysis process. Here are some examples of how to compare the performance of different regression models in R:

# Multiple linear regression
lm.fit <- lm(mpg ~ wt + disp + hp, data = train)
lm.aic <- AIC(lm.fit)
lm.bic <- BIC(lm.fit)

# Polynomial regression (degree 2)
poly.fit <- lm(mpg ~ poly(wt, 2) + poly(disp, 2) + poly(hp, 2), data = train)
poly.aic <- AIC(poly.fit)
poly.bic <- BIC(poly.fit)


# Ridge regression
library(lmridge)

## Warning: package 'lmridge' was built under R version 4.1.3

ridge.fit <- lmridge(mpg ~ wt + disp + hp, data=mtcars,k=nrow(mtcars))
ridge.aic <- infocr(ridge.fit)[,1]
ridge.bic <- infocr(ridge.fit)[,2]

# LASSO regression
lasso.fit <- glmnet(as.matrix(train[, -1]), train[, 1], alpha = 1)
fit<-lasso.fit
tLL <- fit$nulldev - deviance(fit)
k <- fit$df
n <- fit$nobs
lasso.aic <- -2*tLL+2*k
lasso.bic<-log(n)*k - tLL
# Robust regression
library(MASS)
huber.fit <- rlm(mpg ~ wt + disp + hp, data = train, psi = psi.huber)
huber.aic <- AIC(huber.fit)
huber.bic <- BIC(huber.fit)


# Quantile regression
library(quantreg)
quantile.fit <- rq(mpg ~ wt + disp + hp, data = train, tau = 0.5)
quantile.aic <- AIC(quantile.fit)
quantile.bic <- BIC(quantile.fit)


# Elastic net regression
enet.fit <- glmnet(as.matrix(train[, -1]), train[, 1], alpha = 0.5)
fit<-enet.fit
tLL <- fit$nulldev - deviance(fit)
k <- fit$df
n <- fit$nobs
enet.aic <- -2*tLL+2*k
enet.bic<-log(n)*k - tLL

Finally, we can compare the AIC, and BIC values for each model using a table:

results <- cbind(Model = c("Multiple Linear", "Polynomial", "Ridge", "LASSO", "Robust", "Quantile", "Elastic Net"), AIC = c(lm.aic, poly.aic, ridge.aic, lasso.aic, huber.aic, quantile.aic, enet.aic), BIC = c(lm.bic, poly.bic, ridge.bic, lasso.bic, huber.bic, quantile.bic, enet.bic))

## Warning in cbind(Model = c("Multiple Linear", "Polynomial", "Ridge", "LASSO", :
## number of rows of result is not a multiple of vector length (arg 1)

results

##        Model             AIC                 BIC                
##   [1,] "Multiple Linear" "136.875442424734"  "143.53646497561"  
##   [2,] "Polynomial"      "131.856202586974"  "142.513838668375" 
##   [3,] "Ridge"           "63.8309066455888"  "179.131663243579" 
##   [4,] "LASSO"           "0"                 "0"                
##   [5,] "Robust"          "-230.642855961139" "-112.989223470394"
##   [6,] "Quantile"        "-423.787042080473" "-209.561316530061"
##   [7,] "Elastic Net"     "-599.795364578519" "-295.233273268909"
##   [8,] "Multiple Linear" "-748.330882747766" "-369.501032353533"
##   [9,] "Polynomial"      "-871.651750979027" "-431.161466469163"
##  [10,] "Ridge"           "-974.080371046097" "-482.375776502698"
##  [11,] "LASSO"           "-1059.07335662328" "-524.87226929129" 
##  [12,] "Robust"          "-1129.63563752291" "-560.153409741107"
##  [13,] "Quantile"        "-1188.2176794157"  "-589.444430687497"
##  [14,] "Elastic Net"     "-1236.85351974314" "-613.762350851219"
##  [15,] "Multiple Linear" "-1276.34392883931" "-631.175350889127"
##  [16,] "Polynomial"      "-1312.18703690844" "-649.096904923693"
##  [17,] "Ridge"           "-1341.95975289385" "-663.9832629164"  
##  [18,] "LASSO"           "-1366.67770446007" "-676.342238699509"
##  [19,] "Robust"          "-1387.19898209011" "-686.602877514531"
##  [20,] "Quantile"        "-1404.22924053538" "-695.118006737163"
##  [21,] "Elastic Net"     "-1418.37475558197" "-702.190764260461"
##  [22,] "Multiple Linear" "-1430.11878081594" "-708.062776877446"
##  [23,] "Polynomial"      "-1439.86887821913" "-712.937825579039"
##  [24,] "Ridge"           "-1447.95967156177" "-716.983222250361"
##  [25,] "LASSO"           "-1452.91473514363" "-717.128549531114"
##  [26,] "Robust"          "-1459.22375531366" "-720.283059616129"
##  [27,] "Quantile"        "-1462.87407349372" "-719.776014195983"
##  [28,] "Elastic Net"     "-1466.60514431723" "-719.309345097566"
##  [29,] "Multiple Linear" "-1473.83463074656" "-722.924088312227"
##  [30,] "Polynomial"      "-1479.70929629782" "-725.861421087857"
##  [31,] "Ridge"           "-1484.58618968203" "-728.299867779964"
##  [32,] "LASSO"           "-1488.63989519454" "-730.326720536217"
##  [33,] "Robust"          "-1493.79227126712" "-730.570704062332"
##  [34,] "Quantile"        "-1500.16319927274" "-733.756168065142"
##  [35,] "Elastic Net"     "-1503.99075127911" "-733.337739558152"
##  [36,] "Multiple Linear" "-1509.609588312"   "-736.1471580746"  
##  [37,] "Polynomial"      "-1514.22324513699" "-738.453986487093"
##  [38,] "Ridge"           "-1518.04485598368" "-740.364791910438"
##  [39,] "LASSO"           "-1521.21918712194" "-741.951957479569"
##  [40,] "Robust"          "-1523.85615919088" "-743.270443514039"
##  [41,] "Quantile"        "-1527.9341122842"  "-747.641624570872"
##  [42,] "Elastic Net"     "-1527.05046957021" "-744.867598703702"
##  [43,] "Multiple Linear" "-1528.01406919959" "-745.349398518391"
##  [44,] "Polynomial"      "-1528.82221989083" "-745.753473864015"
##  [45,] "Ridge"           "-1529.48992593759" "-746.087326887395"
##  [46,] "LASSO"           "-1530.05083535704" "-746.367781597119"
##  [47,] "Robust"          "-1530.51129600494" "-746.598011921067"
##  [48,] "Quantile"        "-1530.89882543463" "-746.791776635914"
##  [49,] "Elastic Net"     "-1531.21602997324" "-746.950378905218"
##  [50,] "Multiple Linear" "-1531.48389963943" "-747.084313738313"
##  [51,] "Polynomial"      "-1531.70234274948" "-747.193535293338"
##  [52,] "Ridge"           "-1531.88762670631" "-747.286177271752"
##  [53,] "LASSO"           "-1532.03797794712" "-747.361352892156"
##  [54,] "Robust"          "-1532.16623568776" "-747.425481762478"
##  [55,] "Quantile"        "-1531.41223201862" "-744.716275417731"
##  [56,] "Elastic Net"     "-1532.89568250273" "-745.458000659786"
##  [57,] "Multiple Linear" "-1534.1053729072"  "-746.062845862022"
##  [58,] "Polynomial"      "-1535.11883868043" "-746.569578748638"
##  [59,] "Ridge"           "-1535.96301562571" "-746.991667221279"
##  [60,] "LASSO"           "-1535.73021147827" "-744.543060637384"
##  [61,] "Robust"          "-1536.98500288877" "-745.170456342635"
##  [62,] "Quantile"        "-1537.94575477473" "-745.650832285612"
##  [63,] "Elastic Net"     "-1538.7501900214"  "-746.053049908947"
##  [64,] "Multiple Linear" "-1539.41900041389" "-746.387455105192"
##  [65,] "Polynomial"      "-1539.97406855926" "-746.664989177879"
##  [66,] "Ridge"           "-1540.43475092171" "-746.895330359101"
##  [67,] "LASSO"           "-1540.8171430524"  "-747.08652642445" 
##  [68,] "Robust"          "-1541.13451241007" "-747.245211103284"
##  [69,] "Quantile"        "-1541.39782018279" "-747.376864989645"
##  [70,] "Elastic Net"     "-1541.62196052754" "-747.488935162017"
##  [71,] "Multiple Linear" "-1541.80402606337" "-747.579967929933"
##  [72,] "Polynomial"      "-1541.95361336026" "-747.654761578378"
##  [73,] "Ridge"           "-1542.08191271136" "-747.718911253929"
##  [74,] "LASSO"           "-1542.18993722527" "-747.772923510882"
##  [75,] "Robust"          "-1542.23796212862" "-747.79693596256" 
##  [76,] "Quantile"        "-1542.35640056706" "-747.856155181776"
##  [77,] "Elastic Net"     "-1542.37056622933" "-747.863238012914"
##  [78,] "Multiple Linear" "137.107253946136"  "143.768276497012" 
##  [79,] "Polynomial"      "134.893047763808"  NA                 
##  [80,] "Ridge"           "0"                 "0"                
##  [81,] "LASSO"           "-140.687410599244" "-65.6792962792717"
##  [82,] "Robust"          "-310.666477877737" "-148.336625408343"
##  [83,] "Quantile"        "-465.155112233497" "-225.580942586223"
##  [84,] "Elastic Net"     "-600.835369697616" "-293.421071318283"
##  [85,] "Multiple Linear" "-721.870166638043" "-351.606265278321"
##  [86,] "Polynomial"      "-830.713360561355" "-406.027862239977"
##  [87,] "Ridge"           "-924.554234834653" "-452.948299376626"
##  [88,] "LASSO"           "-1005.34621806766" "-493.344290993129"
##  [89,] "Robust"          "-1074.7728438817"  "-528.057603900148"
##  [90,] "Quantile"        "-1133.35647736712" "-555.017216132685"
##  [91,] "Elastic Net"     "-1185.89788155101" "-581.287918224629"
##  [92,] "Multiple Linear" "-1230.69225498512" "-603.685104941685"
##  [93,] "Polynomial"      "-1268.92927463322" "-622.803614765735"
##  [94,] "Ridge"           "-1299.74908672971" "-635.881316303806"
##  [95,] "LASSO"           "-1329.08212647959" "-650.547836178745"
##  [96,] "Robust"          "-1354.13891056514" "-663.076228221518"
##  [97,] "Quantile"        "-1375.55834667842" "-673.785946278158"
##  [98,] "Elastic Net"     "-1393.89285291212" "-682.953199395007"
##  [99,] "Multiple Linear" "-1409.05083452547" "-688.199985691507"
## [100,] "Polynomial"      "-1424.12019876939" "-695.73466781347" 
## [101,] "Ridge"           "-1436.96931790781" "-702.159227382677"
## [102,] "LASSO"           "-1447.92756585574" "-707.638351356644"
## [103,] "Robust"          "-1457.26014559494" "-712.304641226246"
## [104,] "Quantile"        "-1462.24698356797" "-710.133651192406"
## [105,] "Elastic Net"     "-1471.10156854848" "-714.560943682665"
## [106,] "Multiple Linear" "-1480.07753297949" "-721.381130408344"
## [107,] "Polynomial"      "-1486.24431088056" "-724.464519358878"
## [108,] "Ridge"           "-1491.69750609047" "-727.191116963833"
## [109,] "LASSO"           "-1496.55764397054" "-729.621185903867"
## [110,] "Robust"          "-1500.87534544389" "-731.780036640543"
## [111,] "Quantile"        "-1504.75606091506" "-733.72039437613" 
## [112,] "Elastic Net"     "-1508.22007185453" "-735.452399845862"
## [113,] "Multiple Linear" "-1511.33415492018" "-737.009441378686"
## [114,] "Polynomial"      "-1514.13690906956" "-738.410818453379"
## [115,] "Ridge"           "-1516.68205051826" "-739.683389177726"
## [116,] "LASSO"           "-1518.95679161796" "-740.820759727577"
## [117,] "Robust"          "-1521.00673204307" "-741.845729940133"
## [118,] "Quantile"        "-1522.85189259157" "-742.768310214384"
## [119,] "Elastic Net"     "-1524.51040105274" "-743.597564444969"
## [120,] "Multiple Linear" "-1524.35073108383" "-741.18552495034" 
## [121,] "Polynomial"      "-1527.74903724925" "-745.216882543226"
## [122,] "Ridge"           "-1528.51456534721" "-745.599646592203"
## [123,] "LASSO"           "-1529.16648863046" "-745.925608233828"
## [124,] "Robust"          "-1529.7134576527"  "-746.199092744946"
## [125,] "Quantile"        "-1530.186699191"   "-746.435713514097"
## [126,] "Elastic Net"     "-1530.58308134373" "-746.633904590461"
## [127,] "Multiple Linear" "-1530.92669716707" "-746.805712502132"
## [128,] "Polynomial"      "-1531.21977998602" "-746.95225391161" 
## [129,] "Ridge"           "-1531.4621880348"  "-747.073457935997"
## [130,] "LASSO"           "-1531.67247057493" "-747.178599206062"
## [131,] "Robust"          "-1530.05626361937" "-744.038291218109"
## [132,] "Quantile"        "-1531.31501728776" "-744.667668052304"
## [133,] "Elastic Net"     "-1532.59810869444" "-745.309213755645"
## [134,] "Multiple Linear" "-1533.70797585708" "-745.864147336964"
## [135,] "Polynomial"      "-1534.65289416831" "-746.336606492578"
## [136,] "Ridge"           "-1533.79747050657" "-743.576690151532"
## [137,] "LASSO"           "-1535.11113711478" "-744.23352345564" 
## [138,] "Robust"          "-1536.2849603452"  "-744.820435070849"
## [139,] "Quantile"        "-1537.27980003459" "-745.317854915545"
## [140,] "Elastic Net"     "-1538.12446383629" "-745.740186816395"
## [141,] "Multiple Linear" "-1538.84104706688" "-746.09847843169" 
## [142,] "Polynomial"      "-1539.44803740636" "-746.401973601428"
## [143,] "Ridge"           "-1539.97534704723" "-746.665628421862"
## [144,] "LASSO"           "-1540.40935616771" "-746.882632982103"
## [145,] "Robust"          "-1540.78588824885" "-747.070899022675"
## [146,] "Quantile"        "-1541.09449068789" "-747.225200242195"
## [147,] "Elastic Net"     "-1541.35340820793" "-747.354659002211"
## [148,] "Multiple Linear" "-1541.57974485044" "-747.467827323469"
## [149,] "Polynomial"      "-1541.76379206336" "-747.559850929926"
## [150,] "Ridge"           "-1541.91723786745" "-747.636573831973"
## [151,] "LASSO"           "-1542.04589501741" "-747.700902406951"
## [152,] "Robust"          "-1542.15980841151" "-747.757859104003"
## [153,] "Quantile"        "-1542.25153402509" "-747.803721910792"
## [154,] "Elastic Net"     "-1542.32716603456" "-747.841537915528"
## [155,] "Multiple Linear" "-1542.39474748911" "-747.875328642803"
## [156,] "Polynomial"      "-1542.44780496031" "-747.901857378404"
## [157,] "Ridge"           "-1542.49508219717" "-747.925495996834"
## [158,] "LASSO"           "-1542.53280082392" "-747.944355310206"
## [159,] "Robust"          "-1542.56342875222" "-747.95966927436" 
## [160,] "Quantile"        "-1542.59148463955" "-747.973697218023"
## [161,] "Elastic Net"     "-1542.61362351296" "-747.984766654729"
## [162,] "Multiple Linear" "-1542.63374694271" "-747.994828369602"
## [163,] "Polynomial"      "-1542.64898365855" "-748.002446727523"

Based on the AIC, and BIC values, we can see that the regression model that has the lowest values for all the information criterion will be the superior model. However, we should also consider other factors such as the interpretability and practical usefulness of the models before making a final decision.

Module VI: Regression Models for Count (Discrete) Variable Responses

Introduction

Module VI of our course covers regression models for discrete variable responses, with a focus on count data. In this module, we explored 10 different count regression models that can be used to analyze data with discrete, non-negative count outcomes. These models include:

Poisson regression
Negative binomial regression
Quasi-Poisson regression
Geometric regression
Zero-inflated Poisson regression
Zero-inflated negative binomial regression
Zero-truncated Poisson regression
Zero-truncated negative binomial regression
Hurdle Poisson regression
Hurdle negative binomial regression

Each of these models has its own strengths and weaknesses, and is appropriate for different types of count data. For example, Poisson regression is often used when the mean and variance of the count data are approximately equal, while negative binomial regression is useful when there is overdispersion in the data (i.e., the variance is greater than the mean). The zero-inflated and hurdle models are designed to handle situations where there are excess zeros in the data, which can be caused by a variety of factors such as measurement error or the presence of a sub-population with a different underlying distribution.

Overall, this module provides a comprehensive overview of the different count regression models that are available for analyzing count data, and will be useful for anyone working with this type of data in their research or professional work.

Poisson Regression

Introduction:

Poisson regression is a type of regression analysis that is used to model count data. It is based on the Poisson distribution, which is a probability distribution that describes the number of times an event occurs in a fixed interval of time or space. Poisson regression is commonly used in fields such as epidemiology, biology, and economics, where the outcome variable is a count of events.

Model Formulation:

The Poisson regression model assumes that the dependent variable \(Y\) follows a Poisson distribution with a mean parameter \(\lambda\). The model assumes that the log of the mean parameter is a linear function of the independent variables \(X\), and is given by the following equation:

\[ \ln(\lambda) = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \cdots + \beta_k X_k \]

where \(\beta_0\) is the intercept, \(\beta_1, \beta_2, \ldots, \beta_k\) are the coefficients of the independent variables \(X_1, X_2, \ldots, X_k\), respectively.

Model Assumptions:

The assumptions of the Poisson regression model are:

Independence: The observations should be independent of each other. This means that the occurrence of an event for one unit should not affect the occurrence of an event for another unit.
Count data: The dependent variable should be count data, which means that it should be a discrete variable that takes non-negative integer values.
Linearity: The relationship between the logarithm of the mean parameter and the independent variables should be linear. This means that the effect of the independent variables on the dependent variable should be proportional to the change in the mean of the dependent variable.
Homogeneity of variance: The variance of the dependent variable should be equal to its mean, which is the defining property of the Poisson distribution. This assumption is referred to as equidispersion. If the variance is greater than the mean (overdispersion) or less than the mean (underdispersion), then the Poisson regression model may not be appropriate.
No multicollinearity: The independent variables should not be highly correlated with each other. If there is multicollinearity among the independent variables, then the coefficients of the independent variables may be unstable and difficult to interpret.
Lack of influential outliers: Outliers or extreme values in the independent variables or the dependent variable can affect the results of the Poisson regression model. Therefore, it is important to check for influential outliers and consider their impact on the results.

Application to Real-life Data:

Poisson regression can be applied to a wide range of real-life data, such as the number of deaths due to a particular disease, the number of accidents on a particular road, or the number of customers visiting a store on a particular day. For example, a Poisson regression model can be used to analyze the factors that affect the number of accidents on a particular road, such as traffic volume, weather conditions, and road design.

Limitations:

Although Poisson regression is a powerful statistical model for analyzing count data, it has some limitations that should be considered. Here are some of the limitations of the Poisson regression model:

Overdispersion: The Poisson regression model assumes that the variance of the dependent variable is equal to its mean, which is called equidispersion. However, in some cases, the variance may be larger than the mean, which is called overdispersion. Overdispersion can lead to biased parameter estimates and incorrect inferences.
Zero Inflation: The Poisson regression model assumes that the probability of a count being zero is low. However, in some cases, there may be an excess number of zeros in the data, which is called zero inflation. Poisson regression may not be appropriate for such data, and alternative models such as zero-inflated Poisson regression or hurdle models may be more appropriate.
Nonlinear relationships: The Poisson regression model assumes that the relationship between the dependent variable and the independent variables is linear. If the relationship is nonlinear, then the Poisson regression model may not capture the true relationship between the variables.
Outliers: The Poisson regression model is sensitive to outliers, which can affect the estimates of the regression coefficients and the goodness-of-fit of the model.
Independence: The Poisson regression model assumes that the observations are independent of each other. However, in some cases, the observations may be correlated, which can violate the independence assumption of the model.
Model Selection: Poisson regression models can be complex, and there may be several models with different independent variables that can fit the data equally well. Choosing the best model can be challenging and requires careful consideration of the model assumptions, goodness-of-fit, and predictive accuracy.
Large counts: The Poisson regression model may not be appropriate for count data with very large counts, especially if the sample size is small. As the counts get larger, the normal approximation used in the Poisson regression model may become less accurate, and alternative models such as negative binomial regression may be more appropriate.
Continuous independent variables: The Poisson regression model assumes that the independent variables are categorical or count variables. If the independent variables are continuous, the Poisson regression model may not capture the true relationship between the variables. In such cases, a transformation of the independent variables or the use of alternative models such as generalized linear models may be necessary.
Missing data: The Poisson regression model may not be appropriate for count data with missing values, as the missing values can affect the estimates of the regression coefficients and the goodness-of-fit of the model. In such cases, methods such as multiple imputation or maximum likelihood estimation may be used to handle missing data.
Model assumptions: The Poisson regression model assumes that the data follows a Poisson distribution and that the relationship between the dependent variable and the independent variables is linear. Violations of these assumptions can lead to biased parameter estimates and incorrect inferences. Therefore, it is important to assess the goodness-of-fit of the model and to consider alternative models if the assumptions are not met.
Zero truncation: Zero truncation occurs when the data contains only positive counts, and there are no zeros. This can happen when the data is collected from a population that is guaranteed to produce at least one count. The Poisson regression model assumes that the probability of a count being zero is positive, and therefore may not be appropriate for zero-truncated data. In such cases, zero-truncated Poisson regression or zero-inflated Poisson regression may be more appropriate.
Zero asymmetry: Zero asymmetry occurs when there are more zeros in the data than would be expected under a Poisson distribution. This can happen when the data contains a mixture of two populations, one with a high probability of producing zero counts and another with a Poisson distribution for the positive counts. The Poisson regression model assumes that the probability of a count being zero is low, and therefore may not be appropriate for zero-asymmetric data. In such cases, zero-inflated Poisson regression or hurdle models may be more appropriate.

In summary, while the Poisson regression model is a powerful tool for analyzing count data, it has several limitations that should be considered. Choosing the appropriate model requires careful consideration of the characteristics of the data and the assumptions of the model.

Alternative Models

Zero-truncation: If the count data has zero truncation (data contains only positive counts, and there are no zeros), Zero-truncated Poisson regression (or Zero-truncated Negative binomial regression) is an alternative to Poisson regression when the data contains only positive counts, and there are no zeros. This model assumes that the data follows a Poisson distribution (or negative binomial), but with the probability of a count being zero removed from the distribution.
Zero asymmetry: If the count data has more zeros than would be expected under a Poisson distribution, then Poisson regression may not be appropriate. Geometric regression can be an alternative if the data follows a geometric distribution, where the probability of observing the first success (i.e., non-zero count) follows a geometric distribution.
Overdispersion: If the variance of the count data is greater than the mean, then the assumption of equidispersion in Poisson regression is violated. Negative binomial regression is often used as an alternative in such situations. However, if the count data follows a geometric distribution, then geometric regression may be a suitable alternative as it can handle overdispersion.
Underdispersion: If the variance of the count data is less than the mean, then the assumption of equidispersion in Poisson regression is violated. Quasi-poisson regression is often used as an alternative in such situations. However, if the count data follows a geometric distribution, then geometric regression may be a suitable alternative as it can handle underdispersion.
Zero Inflation: Zero-inflated Poisson regression is an alternative to Poisson regression when there is zero inflation in the data, meaning that there are more zeros than expected under a Poisson distribution. This model is a mixture of two distributions, a Poisson distribution for the non-zero counts and a point mass at zero for the excess zeros.
Zero Inflation: Hurdle models are another alternative to Poisson regression when there is zero inflation in the data. This model is a two-part model, with a logistic regression model predicting the probability of a count being zero (the hurdle) and a truncated Poisson or negative binomial model for the positive counts.

10 Count Regression Models

Poisson Regression Model:

The Poisson regression model is used when the count data have a constant mean and variance. The Poisson regression model assumes that the count data follow a Poisson distribution and models the expected count as a function of one or more predictor variables.
Example: A researcher wants to model the number of traffic accidents per day on a busy road. The data show that the number of accidents per day is relatively stable over time and across different weather conditions.

Negative Binomial Regression Model:

The Negative Binomial regression model is used when the count data have overdispersion, meaning that the variance is greater than the mean. The Negative Binomial regression model assumes that the count data follow a Negative Binomial distribution and models the expected count as a function of one or more predictor variables.
Example: A researcher wants to model the number of books read per month by a book club. The data show that some members read many more books than others, resulting in overdispersion.

Geometric Regression Model:

The Geometric regression model is used when the count data have a probability of zero counts and a decreasing probability of non-zero counts. The Geometric regression model assumes that the count data follow a Geometric distribution and models the expected count as a function of one or more predictor variables.
Example: A researcher wants to model the number of attempts it takes for a student to pass a difficult exam. The data show that some students pass on their first attempt, but most require multiple attempts.

Quasi-Poisson Regression Model:

The Quasi-Poisson regression model is used when the count data have underdispersion (or overdispersion) and the Negative Binomial regression model cannot be used due to lack of convergence or other issues. The Quasi-Poisson regression model assumes that the count data follow a Poisson distribution, but allows for the variance to be greater than the mean.
Example: A researcher wants to model the number of visits per day to a website. The data show a lot of variability in the number of visits per day, but the Poisson and Negative Binomial models do not provide a good fit.

Zero-Inflated Poisson (ZIP) Model:

The ZIP model assumes that the count data follow a Poisson distribution but includes an additional component to account for the excess zeros. The model consists of two parts: a binary component that models the occurrence of zeros and a Poisson component that models the non-zero counts. The binary component can be modeled using a logistic regression model, and the Poisson component can be modeled using a Poisson regression model.
Example: A researcher wants to model the number of medical appointments per month for patients with a chronic illness. The data show an excess of patients who did not schedule any appointments in a given month.

Zero-Truncated Poisson (ZTP) Model:

The ZTP model assumes that the count data follow a Poisson distribution but excludes the zeros from the model. The ZTP model is used when the zeros in the data are not excess zeros, but are instead the result of a truncated distribution. The ZTP model can be fitted using a Poisson regression model on the non-zero counts only.
Example: A researcher wants to model the number of fish caught per fishing trip. The data show that some fishing trips resulted in zero fish caught due to weather conditions or other factors, but these zeros are not excess zeros.

Hurdle Poisson Model:

The hurdle model is a two-part model that is similar to the ZIP model but uses a different distribution for the non-zero counts. The hurdle model consists of a binary component that models the occurrence of zeros and a truncated Poisson or negative binomial component that models the non-zero counts.
Example: A researcher wants to model the number of car accidents per day on a busy intersection. The data show an excess of days with no accidents and a right-skewed distribution of the non-zero counts.

Zero-Inflated Negative Binomial (ZINB) Model:

The ZINB model is a generalization of the ZIP model that allows for overdispersion in the count data. The ZINB model consists of two parts: a binary component that models the occurrence of zeros and a negative binomial component that models the non-zero counts. The binary component can be modeled using a logistic regression model, and the negative binomial component can be modeled using a negative binomial regression model.
Example: A researcher wants to model the number of accidents per month in a city. The data show an excess of months with no accidents and overdispersion in the non-zero counts.

Zero-Truncated Negative Binomial (ZTNB) Model:

The ZTNB model is a generalization of the ZTP model that allows for overdispersion in the count data. The ZTNB model assumes that the count data follow a negative binomial distribution but excludes the zeros from the model. The ZTNB model can be fitted using a negative binomial regression model on the non-zero counts only.
Example: A researcher wants to model the number of books read per month by a book club. The data show that members who did not read any books in a given month did not attend the book club meeting.

Hurdle Negative Binomial Model:

The hurdle model is a two-part model that is similar to the ZIP model but uses a different distribution for the non-zero counts. The hurdle model consists of a binary component that models the occurrence of zeros and a truncated negative binomial component that models the non-zero counts.
Example: A researcher wants to model the number of visits per month to a website. The data show an excess of visitors who did not visit the website in a given month and a right-skewed distribution of the non-zero counts.

Real-Life Example - Drafting a Research Paper

A researcher wants to model “Risk Factors of Underfive Mortality in Togdheer: Application of Count Regression Models”. Evidence from the SLDHS data 2020.

Introduction

Under-five mortality refers to the death of children under the age of five years. Despite the significant progress that has been made in reducing under-five mortality globally, the burden of child mortality remains high in many developing countries, including Togdheer. Togdheer is one of the regions in Somaliland that has been severely affected by conflict, drought, and poverty, leading to poor health outcomes, including high under-five mortality rates.

Understanding the risk factors associated with under-five mortality in Togdheer is crucial for developing effective strategies to reduce child mortality. The 2020 Somaliland Demographic and Health Survey (SLDHS) provides a unique opportunity to investigate the risk factors associated with under-five mortality in Togdheer.

This study aims to investigate the risk factors of under-five mortality in Togdheer, using count regression models. The study will utilize the SLDHS 2020 data to identify the factors associated with under-five mortality in Togdheer and estimate the magnitude of their effects.

The findings of this study will contribute to the existing literature on child mortality and provide important insights for policymakers and stakeholders in Togdheer to develop and implement effective interventions to reduce under-five mortality. Additionally, the study will provide evidence-based recommendations for improving child health outcomes in Togdheer and other similar settings.

Materials and Methods

Study Design and Data Source:

This study utilized a cross-sectional study design and analyzed data from the 2020 Somaliland Demographic and Health Survey (SLDHS). The SLDHS is a nationally representative survey that provides data on health indicators, including under-five mortality, maternal and child health, and other relevant demographic and socioeconomic variables.

Study Population:

The study population consisted of children under the age of five years who were born to women aged 15-49 years and were living in Togdheer region at the time of the survey.

Data Collection:

The SLDHS used a stratified two-stage cluster sampling design to select a representative sample of households. In the first stage, clusters (enumeration areas) were selected using probability proportional to size sampling. In the second stage, households were selected using systematic sampling. Trained interviewers collected data using standardized questionnaires, including the Household Questionnaire, Women’s Questionnaire, and Under-Five Questionnaire.

Variables:

The outcome variable for this study was under-five mortality, defined as the number of deaths among children under the age of five years per 1,000 live births. The main independent variables were maternal and child demographic and socioeconomic characteristics, including maternal age, education, occupation, household wealth index, access to health care services, among others.

Statistical Analysis:

Descriptive statistics were used to summarize the characteristics of the study population and the prevalence of under-five mortality. Count regression models, including Poisson, negative binomial, zero inflated poisson, zero inflated negative binomial, hurdle poisson, and hurdle negative binomial regression models, were used to identify the risk factors associated with under-five mortality and estimate the magnitude of their effects.

Ethical Considerations:

The SLDHS was approved by the Somali Ministry of Health and implemented by the National Statistics Directorate. Informed consent was obtained from all participants, and all data were kept confidential and anonymous. This study was approved by the Ethics Review Committee of the University of Togdheer.

Results

Inferential Analysis

# Load library
library(pscl)

## Classes and Methods for R developed in the
## Political Science Computational Laboratory
## Department of Political Science
## Stanford University
## Simon Jackman
## hurdle and zeroinfl functions by Achim Zeileis

library(MASS)
library(actuar)

## Warning: package 'actuar' was built under R version 4.1.3

## 
## Attaching package: 'actuar'

## The following objects are masked from 'package:stats':
## 
##     sd, var

## The following object is masked from 'package:grDevices':
## 
##     cm

# Load data
library(haven)

## Warning: package 'haven' was built under R version 4.1.3

data <- read_dta("Togdheer2.dta")
data<-data[data$region==3,]
data$childsex<-factor(data$childsex)
data$childtwin<-factor(data$childtwin)
data$Residence<-factor(data$Residence)
data$breastfeeding<-factor(data$breastfeeding)
data$Education<-factor(data$Education)
data$toiletfacility<-factor(data$toiletfacility)
data$watersource<-factor(data$watersource)
data$wealthindex<-factor(data$wealthindex)
data$maritalstatus<-factor(data$maritalstatus)
# Fit Poisson model
poisson_model<-glm(numberofdeath~childsex+childtwin+childreneverborn+watersource+toiletfacility+wealthindex+maritalstatus+birthorder+breastfeeding+mothersageatfirstbirth+Residence+Education, data = data, family = poisson())
# Summarize results
summary(poisson_model)

## 
## Call:
## glm(formula = numberofdeath ~ childsex + childtwin + childreneverborn + 
##     watersource + toiletfacility + wealthindex + maritalstatus + 
##     birthorder + breastfeeding + mothersageatfirstbirth + Residence + 
##     Education, family = poisson(), data = data)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.94809  -0.99148   0.02731   0.47494   1.55367  
## 
## Coefficients:
##                          Estimate Std. Error z value Pr(>|z|)    
## (Intercept)            -2.338e+00  2.365e-01  -9.885  < 2e-16 ***
## childsex2              -9.819e-02  6.186e-02  -1.587 0.112465    
## childtwin2              7.330e-03  1.915e-01   0.038 0.969471    
## childreneverborn        1.030e-01  1.316e-02   7.822 5.19e-15 ***
## watersource2            2.566e-01  7.694e-02   3.335 0.000852 ***
## toiletfacility2        -3.069e-01  1.008e-01  -3.046 0.002322 ** 
## wealthindex2            6.405e-01  1.471e-01   4.354 1.34e-05 ***
## wealthindex3            4.012e-01  1.450e-01   2.767 0.005664 ** 
## wealthindex4            5.634e-01  1.377e-01   4.091 4.30e-05 ***
## wealthindex5            5.010e-01  1.405e-01   3.567 0.000362 ***
## maritalstatus2         -1.377e-01  1.133e-01  -1.215 0.224385    
## maritalstatus3         -3.958e-01  1.487e-01  -2.662 0.007758 ** 
## birthorder             -8.997e-06  1.175e-02  -0.001 0.999389    
## breastfeeding2          6.093e-01  9.928e-02   6.137 8.43e-10 ***
## mothersageatfirstbirth  3.417e-02  7.752e-03   4.408 1.04e-05 ***
## Residence2             -1.018e-01  1.041e-01  -0.978 0.328024    
## Residence3             -1.163e-02  9.838e-02  -0.118 0.905877    
## Education2              1.766e-02  1.139e-01   0.155 0.876753    
## Education3             -2.737e-01  2.346e-01  -1.166 0.243415    
## Education4              3.973e-01  4.652e-01   0.854 0.392997    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 1054.56  on 1004  degrees of freedom
## Residual deviance:  778.08  on  985  degrees of freedom
## AIC: 2417.4
## 
## Number of Fisher Scoring iterations: 5

# Fit Negative Binomial model
nb_model <- glm.nb(numberofdeath~childsex+childtwin+childreneverborn+watersource+toiletfacility+wealthindex+maritalstatus+birthorder+breastfeeding+mothersageatfirstbirth+Residence+Education, data = data)

## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached

## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached

# Summarize results
summary(nb_model)

## 
## Call:
## glm.nb(formula = numberofdeath ~ childsex + childtwin + childreneverborn + 
##     watersource + toiletfacility + wealthindex + maritalstatus + 
##     birthorder + breastfeeding + mothersageatfirstbirth + Residence + 
##     Education, data = data, init.theta = 24126.54212, link = log)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9481  -0.9915   0.0273   0.4749   1.5536  
## 
## Coefficients:
##                          Estimate Std. Error z value Pr(>|z|)    
## (Intercept)            -2.338e+00  2.365e-01  -9.884  < 2e-16 ***
## childsex2              -9.819e-02  6.187e-02  -1.587 0.112469    
## childtwin2              7.332e-03  1.915e-01   0.038 0.969462    
## childreneverborn        1.030e-01  1.316e-02   7.822 5.21e-15 ***
## watersource2            2.566e-01  7.694e-02   3.335 0.000853 ***
## toiletfacility2        -3.069e-01  1.008e-01  -3.046 0.002322 ** 
## wealthindex2            6.405e-01  1.471e-01   4.354 1.34e-05 ***
## wealthindex3            4.013e-01  1.450e-01   2.767 0.005663 ** 
## wealthindex4            5.635e-01  1.377e-01   4.091 4.30e-05 ***
## wealthindex5            5.010e-01  1.405e-01   3.567 0.000362 ***
## maritalstatus2         -1.377e-01  1.133e-01  -1.215 0.224399    
## maritalstatus3         -3.958e-01  1.487e-01  -2.662 0.007760 ** 
## birthorder             -8.966e-06  1.175e-02  -0.001 0.999391    
## breastfeeding2          6.093e-01  9.929e-02   6.136 8.44e-10 ***
## mothersageatfirstbirth  3.417e-02  7.752e-03   4.408 1.04e-05 ***
## Residence2             -1.018e-01  1.041e-01  -0.978 0.328041    
## Residence3             -1.164e-02  9.838e-02  -0.118 0.905821    
## Education2              1.766e-02  1.139e-01   0.155 0.876721    
## Education3             -2.737e-01  2.346e-01  -1.166 0.243415    
## Education4              3.973e-01  4.652e-01   0.854 0.393011    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(24126.54) family taken to be 1)
## 
##     Null deviance: 1054.52  on 1004  degrees of freedom
## Residual deviance:  778.06  on  985  degrees of freedom
## AIC: 2419.4
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  24127 
##           Std. Err.:  96258 
## Warning while fitting theta: iteration limit reached 
## 
##  2 x log-likelihood:  -2377.401

# Fit Quasi-Poisson model
qp_model <- glm(numberofdeath~childsex+childtwin+childreneverborn+watersource+toiletfacility+wealthindex+maritalstatus+birthorder+breastfeeding+mothersageatfirstbirth+Residence+Education, data = data, family = quasi(link = "log", variance = "mu^2"))

# Summarize results
summary(qp_model)

## 
## Call:
## glm(formula = numberofdeath ~ childsex + childtwin + childreneverborn + 
##     watersource + toiletfacility + wealthindex + maritalstatus + 
##     birthorder + breastfeeding + mothersageatfirstbirth + Residence + 
##     Education, family = quasi(link = "log", variance = "mu^2"), 
##     data = data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.5645   0.0000   0.0000   0.4933   2.4611  
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            -2.2181763  0.2076058 -10.685  < 2e-16 ***
## childsex2              -0.1843738  0.0582980  -3.163 0.001611 ** 
## childtwin2              0.0821773  0.2083870   0.394 0.693409    
## childreneverborn        0.0933266  0.0127406   7.325 4.96e-13 ***
## watersource2            0.3368912  0.0675941   4.984 7.35e-07 ***
## toiletfacility2        -0.3728549  0.0925331  -4.029 6.02e-05 ***
## wealthindex2            0.7159793  0.1214427   5.896 5.12e-09 ***
## wealthindex3            0.7812116  0.1196540   6.529 1.06e-10 ***
## wealthindex4            0.7747426  0.1083647   7.149 1.70e-12 ***
## wealthindex5            0.7702721  0.1111336   6.931 7.54e-12 ***
## maritalstatus2         -0.1689206  0.1072387  -1.575 0.115535    
## maritalstatus3         -0.3904472  0.1271843  -3.070 0.002200 ** 
## birthorder              0.0005146  0.0121542   0.042 0.966236    
## breastfeeding2          0.6008211  0.0754787   7.960 4.71e-15 ***
## mothersageatfirstbirth  0.0288872  0.0075836   3.809 0.000148 ***
## Residence2             -0.1453165  0.0967993  -1.501 0.133621    
## Residence3             -0.1671277  0.0928746  -1.799 0.072246 .  
## Education2              0.0782821  0.1144187   0.684 0.494027    
## Education3             -0.4491009  0.1879573  -2.389 0.017064 *  
## Education4              0.3690813  0.4270735   0.864 0.387683    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for quasi family taken to be 0.8339886)
## 
##     Null deviance: 255.51  on 1004  degrees of freedom
## Residual deviance: 264.35  on  985  degrees of freedom
## AIC: NA
## 
## Number of Fisher Scoring iterations: 19

# Fit ZIP model
zip_model <-zeroinfl(numberofdeath~childsex+childtwin+childreneverborn+watersource+toiletfacility+wealthindex+maritalstatus+birthorder+breastfeeding+mothersageatfirstbirth+Residence+Education|1, data = data, dist = "poisson")

# Summarize results
summary(zip_model)

## 
## Call:
## zeroinfl(formula = numberofdeath ~ childsex + childtwin + childreneverborn + 
##     watersource + toiletfacility + wealthindex + maritalstatus + birthorder + 
##     breastfeeding + mothersageatfirstbirth + Residence + Education | 
##     1, data = data, dist = "poisson")
## 
## Pearson residuals:
##      Min       1Q   Median       3Q      Max 
## -1.37754 -0.70104  0.02747  0.52146  1.91064 
## 
## Count model coefficients (poisson with log link):
##                          Estimate Std. Error z value Pr(>|z|)    
## (Intercept)            -2.338e+00  2.365e-01  -9.886  < 2e-16 ***
## childsex2              -9.820e-02  6.186e-02  -1.587 0.112426    
## childtwin2              7.261e-03  1.915e-01   0.038 0.969758    
## childreneverborn        1.030e-01  1.316e-02   7.823 5.17e-15 ***
## watersource2            2.566e-01  7.694e-02   3.335 0.000852 ***
## toiletfacility2        -3.069e-01  1.008e-01  -3.045 0.002325 ** 
## wealthindex2            6.406e-01  1.471e-01   4.354 1.34e-05 ***
## wealthindex3            4.013e-01  1.450e-01   2.767 0.005652 ** 
## wealthindex4            5.635e-01  1.377e-01   4.091 4.30e-05 ***
## wealthindex5            5.011e-01  1.405e-01   3.567 0.000361 ***
## maritalstatus2         -1.377e-01  1.133e-01  -1.215 0.224487    
## maritalstatus3         -3.957e-01  1.486e-01  -2.662 0.007769 ** 
## birthorder             -1.015e-05  1.175e-02  -0.001 0.999311    
## breastfeeding2          6.093e-01  9.929e-02   6.137 8.42e-10 ***
## mothersageatfirstbirth  3.418e-02  7.751e-03   4.409 1.04e-05 ***
## Residence2             -1.018e-01  1.041e-01  -0.978 0.328076    
## Residence3             -1.166e-02  9.838e-02  -0.119 0.905636    
## Education2              1.771e-02  1.139e-01   0.156 0.876423    
## Education3             -2.736e-01  2.346e-01  -1.166 0.243486    
## Education4              3.980e-01  4.650e-01   0.856 0.392111    
## 
## Zero-inflation model coefficients (binomial with logit link):
##             Estimate Std. Error z value Pr(>|z|)
## (Intercept)   -14.85     108.51  -0.137    0.891
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Number of iterations in BFGS optimization: 45 
## Log-likelihood: -1189 on 21 Df

# Fit ZTP model
ztp_model <-glm(numberofdeath~childsex+childtwin+childreneverborn+watersource+toiletfacility+wealthindex+maritalstatus+birthorder+breastfeeding+mothersageatfirstbirth+Residence+Education, data = subset(data, numberofdeath > 0), family = poisson())

# Summarize results
summary(ztp_model)

## 
## Call:
## glm(formula = numberofdeath ~ childsex + childtwin + childreneverborn + 
##     watersource + toiletfacility + wealthindex + maritalstatus + 
##     birthorder + breastfeeding + mothersageatfirstbirth + Residence + 
##     Education, family = poisson(), data = subset(data, numberofdeath > 
##     0))
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -0.95934  -0.35266  -0.09722   0.25453   1.24787  
## 
## Coefficients:
##                          Estimate Std. Error z value Pr(>|z|)    
## (Intercept)            -0.8124986  0.2345634  -3.464 0.000532 ***
## childsex2               0.0079055  0.0626173   0.126 0.899533    
## childtwin2             -0.0762963  0.1921469  -0.397 0.691314    
## childreneverborn        0.0648403  0.0129062   5.024 5.06e-07 ***
## watersource2            0.0457141  0.0801554   0.570 0.568462    
## toiletfacility2        -0.3505887  0.1004278  -3.491 0.000481 ***
## wealthindex2            0.4110245  0.1473896   2.789 0.005292 ** 
## wealthindex3            0.1046121  0.1453998   0.719 0.471846    
## wealthindex4            0.1605622  0.1363616   1.177 0.239007    
## wealthindex5            0.0516197  0.1422557   0.363 0.716705    
## maritalstatus2          0.0703858  0.1164930   0.604 0.545707    
## maritalstatus3          0.0135826  0.1521566   0.089 0.928870    
## birthorder             -0.0005783  0.0118074  -0.049 0.960939    
## breastfeeding2          0.1732129  0.1015624   1.705 0.088105 .  
## mothersageatfirstbirth  0.0290687  0.0076072   3.821 0.000133 ***
## Residence2             -0.2246540  0.1104497  -2.034 0.041952 *  
## Residence3              0.1017992  0.1011931   1.006 0.314421    
## Education2              0.1318192  0.1134805   1.162 0.245397    
## Education3              0.1691714  0.2397268   0.706 0.480385    
## Education4              0.0123514  0.4659500   0.027 0.978852    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 251.07  on 693  degrees of freedom
## Residual deviance: 136.12  on 674  degrees of freedom
## AIC: 1775.4
## 
## Number of Fisher Scoring iterations: 4

# Fit hurdle poisson model
hurdlep_model<-hurdle(numberofdeath~childsex+childtwin+childreneverborn+watersource+toiletfacility+wealthindex+maritalstatus+birthorder+breastfeeding+mothersageatfirstbirth+Residence+Education|Residence, data = data, dist = "poisson")
# Summarize results
summary(hurdlep_model)

## 
## Call:
## hurdle(formula = numberofdeath ~ childsex + childtwin + childreneverborn + 
##     watersource + toiletfacility + wealthindex + maritalstatus + birthorder + 
##     breastfeeding + mothersageatfirstbirth + Residence + Education | 
##     Residence, data = data, dist = "poisson")
## 
## Pearson residuals:
##     Min      1Q  Median      3Q     Max 
## -1.4669 -1.1000  0.1561  0.4921  2.8689 
## 
## Count model coefficients (truncated poisson with log link):
##                          Estimate Std. Error z value Pr(>|z|)    
## (Intercept)            -3.358e+00  4.528e-01  -7.416 1.21e-13 ***
## childsex2               5.437e-02  9.745e-02   0.558  0.57690    
## childtwin2             -1.839e-01  2.788e-01  -0.660  0.50945    
## childreneverborn        1.583e-01  2.176e-02   7.275 3.47e-13 ***
## watersource2            4.877e-02  1.537e-01   0.317  0.75096    
## toiletfacility2        -9.117e-01  1.917e-01  -4.755 1.99e-06 ***
## wealthindex2            8.911e-01  2.751e-01   3.240  0.00120 ** 
## wealthindex3            1.108e-01  2.720e-01   0.407  0.68373    
## wealthindex4            5.008e-01  2.623e-01   1.909  0.05624 .  
## wealthindex5            6.011e-02  2.768e-01   0.217  0.82806    
## maritalstatus2          2.917e-01  1.935e-01   1.507  0.13172    
## maritalstatus3          1.285e-01  2.534e-01   0.507  0.61206    
## birthorder             -3.151e-03  1.649e-02  -0.191  0.84847    
## breastfeeding2          7.773e-01  2.373e-01   3.276  0.00105 ** 
## mothersageatfirstbirth  5.642e-02  1.274e-02   4.429 9.46e-06 ***
## Residence2             -5.274e-01  1.973e-01  -2.673  0.00753 ** 
## Residence3              4.022e-01  1.984e-01   2.027  0.04264 *  
## Education2              1.120e-01  1.851e-01   0.605  0.54503    
## Education3              5.508e-01  3.641e-01   1.513  0.13040    
## Education4             -1.382e+01  1.182e+03  -0.012  0.99067    
## Zero hurdle model coefficients (binomial with logit link):
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.80922    0.14911   5.427 5.73e-08 ***
## Residence2   0.07808    0.20142   0.388    0.698    
## Residence3  -0.05024    0.17584  -0.286    0.775    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Number of iterations in BFGS optimization: 43 
## Log-likelihood: -1185 on 23 Df

# Fit ZINB model
zinb_model <-zeroinfl(numberofdeath~childsex+childtwin+childreneverborn+watersource+toiletfacility+wealthindex+maritalstatus+birthorder+breastfeeding+mothersageatfirstbirth+Residence+Education|Residence, data = data, dist = "negbin")

## Warning in value[[3L]](cond): system is computationally singular: reciprocal
## condition number = 5.40204e-37FALSE

# Summarize results
summary(zinb_model)

## 
## Call:
## zeroinfl(formula = numberofdeath ~ childsex + childtwin + childreneverborn + 
##     watersource + toiletfacility + wealthindex + maritalstatus + birthorder + 
##     breastfeeding + mothersageatfirstbirth + Residence + Education | 
##     Residence, data = data, dist = "negbin")
## 
## Pearson residuals:
##      Min       1Q   Median       3Q      Max 
## -1.37751 -0.70108  0.02744  0.52144  1.91058 
## 
## Count model coefficients (negbin with log link):
##                          Estimate Std. Error z value Pr(>|z|)
## (Intercept)            -2.338e+00         NA      NA       NA
## childsex2              -9.819e-02         NA      NA       NA
## childtwin2              7.334e-03         NA      NA       NA
## childreneverborn        1.030e-01         NA      NA       NA
## watersource2            2.566e-01         NA      NA       NA
## toiletfacility2        -3.069e-01         NA      NA       NA
## wealthindex2            6.405e-01         NA      NA       NA
## wealthindex3            4.012e-01         NA      NA       NA
## wealthindex4            5.634e-01         NA      NA       NA
## wealthindex5            5.010e-01         NA      NA       NA
## maritalstatus2         -1.377e-01         NA      NA       NA
## maritalstatus3         -3.958e-01         NA      NA       NA
## birthorder             -9.030e-06         NA      NA       NA
## breastfeeding2          6.093e-01         NA      NA       NA
## mothersageatfirstbirth  3.417e-02         NA      NA       NA
## Residence2             -1.018e-01         NA      NA       NA
## Residence3             -1.163e-02         NA      NA       NA
## Education2              1.766e-02         NA      NA       NA
## Education3             -2.737e-01         NA      NA       NA
## Education4              3.973e-01         NA      NA       NA
## Log(theta)              1.483e+01         NA      NA       NA
## 
## Zero-inflation model coefficients (binomial with logit link):
##             Estimate Std. Error z value Pr(>|z|)
## (Intercept)  -55.515         NA      NA       NA
## Residence2   -21.636         NA      NA       NA
## Residence3    -5.911         NA      NA       NA
## 
## Theta = 2745497.5196 
## Number of iterations in BFGS optimization: 50 
## Log-likelihood: -1189 on 24 Df

# Fit ZTNB model
ztnb_model <-glm.nb(numberofdeath~childsex+childtwin+childreneverborn+watersource+toiletfacility+wealthindex+maritalstatus+birthorder+breastfeeding+mothersageatfirstbirth+Residence+Education, data = subset(data, numberofdeath > 0))

## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached

## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached

# Summarize results
summary(ztnb_model)

## 
## Call:
## glm.nb(formula = numberofdeath ~ childsex + childtwin + childreneverborn + 
##     watersource + toiletfacility + wealthindex + maritalstatus + 
##     birthorder + breastfeeding + mothersageatfirstbirth + Residence + 
##     Education, data = subset(data, numberofdeath > 0), init.theta = 130858.5048, 
##     link = log)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -0.95933  -0.35266  -0.09722   0.25453   1.24785  
## 
## Coefficients:
##                          Estimate Std. Error z value Pr(>|z|)    
## (Intercept)            -0.8124955  0.2345652  -3.464 0.000533 ***
## childsex2               0.0079056  0.0626178   0.126 0.899533    
## childtwin2             -0.0762969  0.1921485  -0.397 0.691314    
## childreneverborn        0.0648402  0.0129063   5.024 5.06e-07 ***
## watersource2            0.0457144  0.0801560   0.570 0.568463    
## toiletfacility2        -0.3505888  0.1004286  -3.491 0.000481 ***
## wealthindex2            0.4110241  0.1473907   2.789 0.005292 ** 
## wealthindex3            0.1046129  0.1454008   0.719 0.471845    
## wealthindex4            0.1605624  0.1363625   1.177 0.239009    
## wealthindex5            0.0516200  0.1422566   0.363 0.716705    
## maritalstatus2          0.0703858  0.1164938   0.604 0.545709    
## maritalstatus3          0.0135823  0.1521576   0.089 0.928871    
## birthorder             -0.0005783  0.0118075  -0.049 0.960939    
## breastfeeding2          0.1732129  0.1015631   1.705 0.088107 .  
## mothersageatfirstbirth  0.0290686  0.0076073   3.821 0.000133 ***
## Residence2             -0.2246540  0.1104506  -2.034 0.041954 *  
## Residence3              0.1017986  0.1011939   1.006 0.314427    
## Education2              0.1318193  0.1134814   1.162 0.245401    
## Education3              0.1691715  0.2397285   0.706 0.480387    
## Education4              0.0123513  0.4659520   0.027 0.978852    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(130858.5) family taken to be 1)
## 
##     Null deviance: 251.07  on 693  degrees of freedom
## Residual deviance: 136.12  on 674  degrees of freedom
## AIC: 1777.4
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  130859 
##           Std. Err.:  887717 
## Warning while fitting theta: iteration limit reached 
## 
##  2 x log-likelihood:  -1735.423

# Fit hurdle negative binomial model

hurdlenb_model<-hurdle(numberofdeath~childsex+childtwin+childreneverborn+watersource+toiletfacility+wealthindex+maritalstatus+birthorder+breastfeeding+mothersageatfirstbirth+Residence+Education|Residence, data = data, dist = "negbin")

## Warning in sqrt(diag(vc_count)[kx + 1]): NaNs produced

# Summarize results
summary(hurdlenb_model)

## 
## Call:
## hurdle(formula = numberofdeath ~ childsex + childtwin + childreneverborn + 
##     watersource + toiletfacility + wealthindex + maritalstatus + birthorder + 
##     breastfeeding + mothersageatfirstbirth + Residence + Education | 
##     Residence, data = data, dist = "negbin")
## 
## Pearson residuals:
##     Min      1Q  Median      3Q     Max 
## -1.4669 -1.1000  0.1561  0.4922  2.8687 
## 
## Count model coefficients (truncated negbin with log link):
##                         Estimate Std. Error z value Pr(>|z|)    
## (Intercept)             -3.35809    0.45282  -7.416 1.21e-13 ***
## childsex2                0.05446    0.09745   0.559  0.57629    
## childtwin2              -0.18384    0.27875  -0.660  0.50957    
## childreneverborn         0.15828    0.02176   7.274 3.48e-13 ***
## watersource2             0.04881    0.15368   0.318  0.75078    
## toiletfacility2         -0.91178    0.19173  -4.755 1.98e-06 ***
## wealthindex2             0.89107    0.27505   3.240  0.00120 ** 
## wealthindex3             0.11082    0.27204   0.407  0.68373    
## wealthindex4             0.50077    0.26230   1.909  0.05624 .  
## wealthindex5             0.06013    0.27677   0.217  0.82801    
## maritalstatus2           0.29167    0.19351   1.507  0.13175    
## maritalstatus3           0.12853    0.25340   0.507  0.61198    
## birthorder              -0.00315    0.01649  -0.191  0.84854    
## breastfeeding2           0.77731    0.23729   3.276  0.00105 ** 
## mothersageatfirstbirth   0.05641    0.01274   4.429 9.46e-06 ***
## Residence2              -0.52740    0.19732  -2.673  0.00752 ** 
## Residence3               0.40220    0.19841   2.027  0.04265 *  
## Education2               0.11206    0.18509   0.605  0.54487    
## Education3               0.55082    0.36414   1.513  0.13037    
## Education4             -12.73724    1.26628 -10.059  < 2e-16 ***
## Log(theta)              14.07542        NaN     NaN      NaN    
## Zero hurdle model coefficients (binomial with logit link):
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.80922    0.14911   5.427 5.73e-08 ***
## Residence2   0.07808    0.20142   0.388    0.698    
## Residence3  -0.05024    0.17584  -0.286    0.775    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Theta: count = 1296814.9889
## Number of iterations in BFGS optimization: 44 
## Log-likelihood: -1185 on 24 Df

In summary, Poisson, Negative Binomial, Geometric, Quasi-Poisson, ZIP, ZINB, ZTP, ZTNB, hurdle poisson, and hurdle negative binomial regression models are all useful for analyzing count data, but they differ in their assumptions and approach. The choice of model depends on the nature of the data and the specific research question being addressed.

Model Comparison

# AIC
aic<-AIC(poisson_model,nb_model,zip_model,zinb_model,hurdlep_model,hurdlenb_model)
bic<-BIC(poisson_model,nb_model,zip_model,zinb_model,hurdlep_model,hurdlenb_model)
aic

##                df      AIC
## poisson_model  20 2417.383
## nb_model       21 2419.401
## zip_model      21 2419.383
## zinb_model     24 2425.383
## hurdlep_model  23 2415.942
## hurdlenb_model 24 2417.942

bic

##                df      BIC
## poisson_model  20 2515.637
## nb_model       21 2522.569
## zip_model      21 2522.550
## zinb_model     24 2543.289
## hurdlep_model  23 2528.935
## hurdlenb_model 24 2535.848

Interpretation

The results presented show the Akaike Information Criterion (AIC)and Bayesian Information Criterion (BIC) values for different count regression models fitted to the data. The information criterions (AIC and BIC) are a measure of the goodness of fit of a statistical model, where a lower (AIC and BIC) values indicates a better fit of the model to the data.

In this study, six count regression models were fitted to the data, including Poisson, negative binomial (NB), zero-inflated Poisson (ZIP), zero-inflated negative binomial (ZINB),hurdle poisson and hurdle negative binomial regression models.

The results show that the hurdle poisson regression model had the lowest AIC and BIC values of 2415.942 and 2528.935 respectively, indicating that it is the best-fitting model among the models considered. The hurdle poisson model is a two-part model that accounts for excess zeros in the data and equidistribution, making it a suitable model for count data with excess zeros and equidistribution.

The hurdle negative binomial model had the second-lowest AIC and BIC values of 2417.942 and 2535.848 respectively, followed by the Poisson model and the NB and ZIP models. The ZINB model had a slightly higher AIC value compared to the hurdle and the poisson models.

Overall, the results suggest that the hurdle poisson model provides the best fit for the data and should be used to identify the risk factors associated with under-five mortality in Togdheer.

Interpretation based on the best model

The results presented are from the hurdle Poisson regression model, which is a two-part model that accounts for excess zeros in the data and equidistribution. The model was used to investigate the risk factors associated with under-five mortality in Togdheer, using a set of independent variables.

The results are presented in two parts. The first part of the results shows the count model coefficients, which represent the association between the independent variables and the count of deaths among children under the age of five years. The intercept term represents the expected count of deaths when all other independent variables are held constant.

The results show that childreneverborn, toilet facility, Residence, breastfeeding, and mothersageatfirstbirth were significantly associated with the count of deaths among children under the age of five years. Specifically, children whose mothers used an improved toilets, were located in rural or nomadic, breastfed their children, and had their first child at a later age were associated with a lower count of under-five mortality.

Other variables, such as childsex, childtwin, watersource, wealthindex, birthorder, and Education, were not significantly associated with the count of under-five mortality in this study.

The second part of the results shows the zero hurdle model coefficients, which represent the association between the intercept and the probability of observing zero deaths among children under the age of five years. The results show that the intercept term was significantly associated with the probability of observing zero deaths.

The model’s log-likelihood was -5531, indicating a good fit of the model to the data. The Pearson residuals show that the model’s residuals are approximately normally distributed.

Overall, the results suggest that factors related to maternal and child health, such as childreneverborn, watersource, maritalstatus, breastfeeding, and mothersageatfirstbirth, are important predictors of under-five mortality in Togdheer. Policymakers and stakeholders should focus on implementing interventions to improve maternal and child health outcomes to reduce under-five mortality in the region.

For more information about drafting research paper using count regression models we can refer readers to (Argawu et al., 2022).

Summary

Negative binomial regression: Negative binomial regression becomes an alternative to Poisson regression when there is overdispersion in the data, which means that the variance is greater than the mean.
Quasi-Poisson regression: Quasi-Poisson regression becomes an alternative to Poisson regression when there is overdispersion in the data, but the negative binomial distribution is not appropriate, or when there is underdispersion in the data.
Geometric regression: Geometric regression becomes an alternative to Poisson regression when the data contains only non-zero counts and the dv follows a geometric distribution.
Zero-inflated Poisson regression: Zero-inflated Poisson regression becomes an alternative to Poisson regression when the data contains excess zeros, which means that there are more zeros in the data than would be expected from a Poisson distribution.
Zero-inflated negative binomial regression: Zero-inflated negative binomial regression becomes an alternative to Poisson regression when the data contains excess zeros and overdispersion.
Zero-truncated Poisson regression: Zero-truncated Poisson regression becomes an alternative to Poisson regression when the data contains non-zero counts, and there is equidistribution of the data.
Zero-truncated negative binomial regression: Zero-truncated negative binomial regression becomes an alternative to Poisson regression when the data contains non-zero counts, and the variance is greater than the mean.
Hurdle Poisson regression: Hurdle Poisson regression becomes an alternative to Poisson regression when the data contains excess zeros and no disperion.
Hurdle negative binomial regression: Hurdle negative binomial regression becomes an alternative to Poisson regression when the data contains excess zeros, and overdispersion.

Overall, these different regression models become alternatives to Poisson regression in various situations where the assumptions of Poisson regression are not met or where the data has certain characteristics such as excess zeros, overdispersion, or only non-zero counts. Understanding the appropriate regression model to use for a given data set is important for obtaining accurate and reliable predictions and insights.

Module VII: Regression Models for Categorical Variable Responses

Regression Models for Ordinal Variable Responses

Regression models with an ordinal response variable are commonly used in situations where the outcome of interest is an ordered categorical variable with three or more levels. In this type of model, the outcome variable is modeled as a function of one or more predictor variables. Here are five regression models commonly used with an ordinal response variable, along with their definition, application, assumptions, real-life examples, and R code:

Ordinal Logistic Regression Model

Definition: Ordinal logistic regression is a statistical model used to analyze the relationship between one or more predictor variables and an ordinal outcome variable with three or more levels. In this model, the odds of being in a higher category of the outcome variable are modeled as a function of the predictor variables.

Application: Ordinal logistic regression can be used in various fields, such as healthcare, social sciences, and marketing, to analyze the relationship between predictor variables and an ordinal outcome variable, such as the severity of a disease, level of education, or customer satisfaction.

Assumptions: The assumptions of ordinal logistic regression include the proportional odds assumption, which states that the odds of being in a higher category of the outcome variable are the same across all levels of the predictor variables.

Real-life example: A researcher wants to investigate the relationship between the level of education and the severity of depression in a sample of patients. The outcome variable is the severity of depression, which ranges from mild to severe, and the predictor variable is the level of education, which is categorized as high school, college, or graduate school.

Ordered Probit Regression Model

Definition: Ordered probit regression is a regression model used to analyze the relationship between one or more predictor variables and an ordinal outcome variable with three or more levels. In this model, the probability of being in each category of the outcome variable is modeled as a function of the predictor variables using a probit link function.

Application: Ordered probit regression can be used in various fields, such as healthcare, social sciences, and marketing, to analyze the relationship between predictor variables and an ordinal outcome variable, such as the severity of a disease, level of education, or customer satisfaction.

Assumptions: The assumptions of ordered probit regression include the normality of the error term and the independence of the error terms across categories of the outcome variable.

Real-life example: A researcher wants to investigate the relationship between the level of physical activity and the level of depression in a sample of adults. The outcome variable is the level of depression, which is categorized as mild, moderate, or severe, and the predictor variable is the level of physical activity, which is categorized as low, moderate, or high.

Cumulative logit model:

Definition: The cumulative logit model is a regression model used to analyze the relationship between one or more predictor variables and an ordinal outcome variable with three or more levels. In this model, the cumulative probabilities of being in each category of the outcome variable are modeled as a function of the predictor variables.

Application: Cumulative logit models can be used in various fields, such as healthcare, social sciences, and marketing, to analyze the relationship between predictor variables and an ordinal outcome variable, such as the severity of a disease, level of education, or customer satisfaction.

Assumptions: The assumptions of the cumulative logit model include the proportional odds assumption, which states that the odds of being in a higher category of the outcome variable are the same across all levels of the predictor variables.

Real-life example: A researcher wants to investigate the relationship between the level of education and the level of income in a sample of workers. The outcome variable is the level of income, which is categorized as low, medium, or high, and the predictor variable is the level of education, which is categorized as high school, college, or graduate school.

Proportional Odds Model

Definition: Proportional odds model is a type of ordinal logistic regression model where the probability of a response variable taking a particular value or lower is modeled as a function of the predictor variables. It is used when the response variable is ordered or ordinal, and the categories have a natural ordering.

Application: The proportional odds model can be applied to a wide range of research areas, including social sciences, medical research, and marketing. It can be used to investigate the relationship between predictor variables and a range of ordinal outcomes, such as Likert scale responses, disease stages, or educational levels.

Assumptions: The proportional odds model assumes that the odds ratios for each predictor variable are constant across all levels of the response variable. It also assumes that the relationship between the predictor variables and the response variable is linear on the log-odds scale.

Real-life example: Suppose a researcher wants to investigate the relationship between age, gender, and income on the level of education attained. The response variable is ordered, with categories of “less than high school,” “high school,” “some college,” and “college degree or higher.” The proportional odds model can be used to model the probability of attaining a certain level of education based on age, gender, and income.

Partial Proportional Odds Model

Definition: The partial proportional odds model is a regression model used to analyze the relationship between one or more predictor variables and an ordinal outcome variable with three or more levels. In this model, the proportional odds assumption is relaxed for one or more of the predictor variables.

*Application: Partial proportional odds models can be used in various fields, such as healthcare, social sciences, and marketing, to analyze the relationship between predictor variables and an ordinal outcome variable when the proportional odds assumption is violated.

Assumptions: The assumptions of the partial proportional odds model include the assumption that the proportional odds assumption holds for at least one of the predictor variables.

Real-life example: A researcher wants to investigate the relationship between the level of social support and the level of depression in a sample of adults. The outcome variable is the level of depression, which is categorized as mild, moderate, or severe, and the predictor variable is the level of social support, which is categorized as low, moderate, or high. However, the proportional odds assumption does not hold for the level of social support variable.

R Code for All

# Load the package and data
library(MASS)
library(ordinal)

## Warning: package 'ordinal' was built under R version 4.1.3

## 
## Attaching package: 'ordinal'

## The following objects are masked from 'package:actuar':
## 
##     dgumbel, dlgamma, pgumbel, plgamma, qgumbel, rgumbel

## The following object is masked from 'package:dplyr':
## 
##     slice

data(Cars93)

# Create an ordered response variable
Cars93$Price <- cut(Cars93$Price, breaks = c(0, 15, 30, 45, 60, 75, 90, Inf), labels = c("0-15", "15-30", "30-45", "45-60", "60-75", "75-90", ">90"))
Cars93$Price <- ordered(Cars93$Price)

# Proportional odds model
model_proportional_odds <- polr(Price ~ MPG.city + MPG.highway + EngineSize, data = Cars93)
summary(model_proportional_odds)

## 
## Re-fitting to get Hessian

## Call:
## polr(formula = Price ~ MPG.city + MPG.highway + EngineSize, data = Cars93)
## 
## Coefficients:
##               Value Std. Error t value
## MPG.city    -0.7621     0.2465 -3.0910
## MPG.highway  0.2014     0.1670  1.2061
## EngineSize   0.2833     0.3743  0.7568
## 
## Intercepts:
##             Value    Std. Error t value 
## 0-15|15-30  -11.1564   3.4265    -3.2559
## 15-30|30-45  -6.5208   3.1823    -2.0491
## 30-45|45-60  -4.2840   3.2662    -1.3116
## 45-60|60-75  -3.5746   3.3428    -1.0694
## 
## Residual Deviance: 119.4952 
## AIC: 133.4952

# Partial proportional odds model
model_partial_proportional_odds <- polr(Price ~ MPG.city + MPG.highway + EngineSize, data = Cars93, method = "logistic")
summary(model_partial_proportional_odds)

## 
## Re-fitting to get Hessian

## Call:
## polr(formula = Price ~ MPG.city + MPG.highway + EngineSize, data = Cars93, 
##     method = "logistic")
## 
## Coefficients:
##               Value Std. Error t value
## MPG.city    -0.7621     0.2465 -3.0910
## MPG.highway  0.2014     0.1670  1.2061
## EngineSize   0.2833     0.3743  0.7568
## 
## Intercepts:
##             Value    Std. Error t value 
## 0-15|15-30  -11.1564   3.4265    -3.2559
## 15-30|30-45  -6.5208   3.1823    -2.0491
## 30-45|45-60  -4.2840   3.2662    -1.3116
## 45-60|60-75  -3.5746   3.3428    -1.0694
## 
## Residual Deviance: 119.4952 
## AIC: 133.4952

# Ordinal logistic regression model
model_ordinal_logistic <- clm(Price ~ MPG.city + MPG.highway + EngineSize, data = Cars93, link = "logit")
summary(model_ordinal_logistic)

## formula: Price ~ MPG.city + MPG.highway + EngineSize
## data:    Cars93
## 
##  link  threshold nobs logLik AIC    niter max.grad cond.H 
##  logit flexible  93   -59.75 133.50 7(0)  3.04e-10 8.4e+05
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## MPG.city     -0.7621     0.2465  -3.091  0.00199 **
## MPG.highway   0.2014     0.1670   1.206  0.22777   
## EngineSize    0.2833     0.3743   0.757  0.44919   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Threshold coefficients:
##             Estimate Std. Error z value
## 0-15|15-30   -11.157      3.427  -3.256
## 15-30|30-45   -6.521      3.182  -2.049
## 30-45|45-60   -4.284      3.266  -1.312
## 45-60|60-75   -3.575      3.343  -1.069

# Ordinal probit regression model
model_ordinal_probit <- clm(Price ~ MPG.city + MPG.highway + EngineSize, data = Cars93, link = "probit")
summary(model_ordinal_probit)

## formula: Price ~ MPG.city + MPG.highway + EngineSize
## data:    Cars93
## 
##  link   threshold nobs logLik AIC    niter max.grad cond.H 
##  probit flexible  93   -59.93 133.85 7(1)  5.82e-09 7.8e+05
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## MPG.city    -0.41922    0.13153  -3.187  0.00144 **
## MPG.highway  0.09881    0.09237   1.070  0.28476   
## EngineSize   0.14504    0.19981   0.726  0.46791   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Threshold coefficients:
##             Estimate Std. Error z value
## 0-15|15-30    -6.529      1.859  -3.512
## 15-30|30-45   -3.900      1.739  -2.243
## 30-45|45-60   -2.769      1.764  -1.570
## 45-60|60-75   -2.494      1.779  -1.402

# Cumulative logit model
model_cumulative_logit <- clm(Price ~ MPG.city + MPG.highway + EngineSize, data = Cars93)
summary(model_cumulative_logit)

## formula: Price ~ MPG.city + MPG.highway + EngineSize
## data:    Cars93
## 
##  link  threshold nobs logLik AIC    niter max.grad cond.H 
##  logit flexible  93   -59.75 133.50 7(0)  3.04e-10 8.4e+05
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## MPG.city     -0.7621     0.2465  -3.091  0.00199 **
## MPG.highway   0.2014     0.1670   1.206  0.22777   
## EngineSize    0.2833     0.3743   0.757  0.44919   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Threshold coefficients:
##             Estimate Std. Error z value
## 0-15|15-30   -11.157      3.427  -3.256
## 15-30|30-45   -6.521      3.182  -2.049
## 30-45|45-60   -4.284      3.266  -1.312
## 45-60|60-75   -3.575      3.343  -1.069

aic<-AIC(model_ordinal_logistic,model_ordinal_probit,model_cumulative_logit,model_proportional_odds,model_partial_proportional_odds)
aic

##                                 df      AIC
## model_ordinal_logistic           7 133.4952
## model_ordinal_probit             7 133.8548
## model_cumulative_logit           7 133.4952
## model_proportional_odds          7 133.4952
## model_partial_proportional_odds  7 133.4952

Regression Models for Binary Variable Responses

Regression Models for Multinominal Variable Responses

Thanks for your attention

A Primer on Regression Modelling using Reproducible R Software

Abdisalam Hassan Muse (PhD)

2023-06-30

Abstract

Module V: Regression Models for Continuous Variable Responses

Introduction

Multiple Linear Regression Model

Model Formulation

Model Assumptions

Model Alternatives

Model Comparisons

Multiple Linear Regression:

Ridge Regression:

Quantile Regression:

LASSO Regression:

Polynomial Regression:

Elastic Net Regression:

Principal Components Regression:

Partial Least Squares Regression:

Real-life Applications

Multiple Linear Regression:

Ridge Regression:

Quantile Regression:

LASSO Regression:

Polynomial Regression:

Elastic Net Regression:

Principal Components Regression:

Partial Least Squares Regression:

Exam Question

Regression Models for Continuous Variable Responses alternative to Multiple Linear Regression Model

Real-life Examples using all the 10 models with continous variable responses

Multiple Linear Regression:

Polynomial Regression:

Ridge Regression:

LASSO Regression:

Robust Regression:

Elastic Net Regression:

Support Vector Regression:

Principal Component Regression:

Partial Least Squares Regression:

Model Adequacy Checking

Introduction

Goodness of fit

Checking for violations of model assumptions

Identifying influential observations

Metrics for model adequacy checking

Root Mean Squared Error (RMSE):

Mean Absolute Error (MAE):

Mean Squared Prediction Error (MSPE):

Conclusion

Real-life Examples for Model Diagnostic Checking

Model Comparison

Akaike Information Criterion (AIC)

Bayesian Information Criterion (BIC)

Hannan-Quinn Information Criterion (HQIC)

Conclusion

Real-life Example

Module VI: Regression Models for Count (Discrete) Variable Responses

Introduction

Poisson Regression

Introduction:

Model Formulation:

Model Assumptions:

Application to Real-life Data:

Limitations:

Alternative Models

10 Count Regression Models

Poisson Regression Model:

Negative Binomial Regression Model:

Geometric Regression Model:

Quasi-Poisson Regression Model:

Zero-Inflated Poisson (ZIP) Model:

Zero-Truncated Poisson (ZTP) Model:

Hurdle Poisson Model:

Zero-Inflated Negative Binomial (ZINB) Model:

Zero-Truncated Negative Binomial (ZTNB) Model:

Hurdle Negative Binomial Model:

Real-Life Example - Drafting a Research Paper

Introduction

Materials and Methods