library(tidyverse)
library(lme4)
library(nlme)
library(ggpubr)
library(kableExtra)
library(MASS)

1. Dataset

df = read.csv("Nacimientos_Hospital_General_de_Medell_n_enero_a_agosto_de_2021.csv",encoding = "UTF-8")

names(df)[1] = "Genero"

df = df[df$Genero != "INDETERMINADO",c("Genero","Peso.en.gramos","Tempo.gestación",
                                       "Tipo.parto","Nombre.administradora","Número.de.embarazos")]

df$Nombre.administradora[df$Nombre.administradora == ""] = "OTRO"

For the development of this application we will be using data related to the birth of children in the city of Medellin from January to August of the year 2021 extracted a the government site in the following link [https://www.datos.gov.co/Seguridad-y-Defensa/Nacimientos-Hospital-General-de-Medell-n-enero-a-a/w4jn-n4vq]. Said dataset contains 30 variables for 2702 different observations regarding information about the newborn as well as the parents. However, only a few of these are of our interest from which we can highlight the next few:

  • Genero: Gender of the newborn.

  • Peso.en.gramos: Weight of the newborn in grams (this is the response variable).

  • Tempo.gestacion: Gestation time.

  • Tipo.parto: Type of birth (Spontaneous, instrumented or Cesarean section).

  • Nombre.administradora: health company that delivered the baby.

  • Número.de.embarazos: Number of pregnancies the mother has gone through.

kbl(head(df)) %>% 
  kable_styling(latex_options = "HOLD_position")
Genero Peso.en.gramos Tempo.gestación Tipo.parto Nombre.administradora Número.de.embarazos
FEMENINO 2877 41 ESPONTÁNEO COOMEVA E.P.S. S.A.-CM 2
MASCULINO 2870 38 ESPONTÁNEO SAVIA SALUD E.P.S. 4
MASCULINO 2897 35 ESPONTÁNEO SAVIA SALUD E.P.S. -CM 4
MASCULINO 2265 37 ESPONTÁNEO SAVIA SALUD E.P.S. 1
FEMENINO 2180 36 ESPONTÁNEO SAVIA SALUD E.P.S. 1
MASCULINO 3552 38 CESÁREA SAVIA SALUD E.P.S. 2

2. Objective

In every society the creation of new life is one of the most, if not the most, important possible events. For thath reason, it has always been mandatory to give the best care and comfort to the new coming child; one of the biggest concerns throughout time has been the weight of the baby, usually babies cannot really be overweight as a newborn; however, a very low birth weight can represent a number of different obstacles for th babies healthy growth. Based on this, we are interested on building a linear model that can, one, help us understand what defines the weight of a baby and, two, provides accurate predictions of the baby’s weight even before the actual birth. All of this for the city of Medellin and maybe Colombia.

This would be of immense value since knowing the approximate weight of the baby beforehand can facilitate its subsequent special care if needed.

3. Descriptive analysis

Before the construction of the model, it is very important to check for the relationship between the response variables and each one of the predictive variables we previously mentioned. This way we can find out whether said variables are really important to predict the weight and also find the real structure of effects for those variables (maybe some are related in a quadratic form).

3.1 Gender

Generally speaking, male humans tend to be bigger a heavier than their female counterparts. However, it is interesting to test whether this trend is also visible for newborn babies or not.

ggplot(df, aes(x = Genero, y = Peso.en.gramos,col = Genero))+
  geom_boxplot()+
  theme_minimal(9)+
  scale_color_manual(values = c("#FF91CA","#75D5F9"))+
  theme(legend.position = "none")+
  labs(title = "Newborn's weight by gender")

Visually, it seems like male babies tend to have a slightly higher weight than female babies. However, it does not look like this difference is significant, which means that it is reasonable to conclude that there is no real difference in the weight for the both genders.

3.2 Gestation time

It might be logical to think that babies who did not spend the necessary time in their mother’s womb did not get to fully develop which would mean that their weight would tend to be lower.

ggplot(df, aes(x = Tempo.gestación, y = Peso.en.gramos))+
  geom_point()+
  geom_smooth(fill = "lightblue",col = "#72ADE1")+
  theme_minimal(9)+
  labs(title = "Newborn's weight Vs Gestation time")

As we expected, there seems to be a positive relation between this two variables, where the longer the baby spends inside the womb, the heavier it will be until around 40 weeks where it seems to stabilize at around 3500 grams, in average.

It is also important to mention that this relation, for the most part, seems to be linear.

3.3 Birth type

It might be possible that being born a certain represents an increase in the average weight of a newborn baby.

ggplot(df, aes(x = Tipo.parto, y = Peso.en.gramos,col = Tipo.parto))+
  geom_boxplot()+
  theme_minimal(9)+
  scale_color_manual(values = c("#EA2626","#5DDCE1","#BF5DE1"))+
  theme(legend.position = "none")+
  labs(title = "Newborn's weight by birth type")

Graphically, we see how, in general, babies born trough Cesarean section have a lighter weight than any other birth type, specially the instrumented one which seems to be the one resulting in the higher weights in average.

3.4 Number of previous pregnancies

We are now trying to see whether having been pregnant multiple times before represents and increment or a decrease in the average weight for a next newborn baby.

ggplot(df, aes(x = Número.de.embarazos, y = Peso.en.gramos))+
  geom_point()+
  geom_smooth(fill = "lightblue",col = "#72ADE1")+
  theme_minimal(9)+
  labs(title = "Newborn's weight Vs Number of previous pregnancies")

It looks like there is no real relation between these two variables. Which means having been pregnant multiple times does not really seems to be influencing the final weight of the next baby.

We already mentioned how we desire to build linear models in order to accurately predict and describe the variable regarding the weight of a newborn baby in the city of Medellin.

Specifically, we will be working with 3 different candidate models for this purpose.

4. Models

Model 1

The first model will be the “base” model. It will include all the variables that we already considered to be important (Except for the hometown) in an ordinary least squares regression that only includes the principal effects of each of these variables. Let’s remember that we have some categorical variables that we will have to turn into indicative variables that take the value of one if a certain condition is met and zero in any other case. For this model, all effects are fixed and not random:

\[y_{j} \sim N(\mu_j,\sigma^2)\]

\[\mu_j = \beta_0 + \sum_{k = 1}^5 \beta_kX_{kj}\] For this model, we are assuming that the error is independent and follows a normal distribution with constant variance.

Model 2

For the second model, we want to include the possible effect that the health company that delivers the baby has over the weight of said baby. Maybe some companies can provide better services for the care of the baby, even before it is born (for example in prenatal appointments). Since there are more than 40 different health companies and we are not really interested in the effect that some specific companies may have on the weight of the baby, it is more than reasonable to include said variable as a random effect (random intercept).

\[y_{ij} \sim N(\mu_{ij},\sigma^2)\] \[\mu_{ij} = \beta_0 + \sum_{k = 1}^5 \beta_kX_{kij} + b_{0i}\] \[b_0 \sim N(0,\sigma^2_{b0})\] Similar to the first model, here we are also assuming normally distributed errors with constant variance. But now we are also assuming that our random intercept is normally distributed with a constant variance.

Model 3

For the third model, we believe it might be interesting to include a random effect for the gestation time of the baby. This will result in a model with a random slope as such:

\[y_{ij} \sim N(\mu_{ij}, \sigma^2)\] \[\mu_{ij} = \beta_0 + \sum_{k = 1}^5 \beta_kX_{kij} + b_{0i} + b_{1i}X_{2ij}\]

\[\begin{pmatrix} b_0\\ b_1 \end{pmatrix} \sim N \left(\begin{pmatrix} 0\\ 0 \end{pmatrix}, \begin{pmatrix} \sigma^2_{b0} & \sigma_{b01}\\ \sigma_{b01} & \sigma^2_{b1} \end{pmatrix}\right)\]

Once again, we assume that the error is independent and follows a normal distribution with mean zero and constant variance. Moreover, it is important to mention that \(X_{2}\) makes reference to the gestation time variable.

5. Fitting the models

Now that we know which models we will be using, let us proceed with the fitting of each one of these models as shown next:

mod1 = gls(Peso.en.gramos~.-Nombre.administradora, data = df)

mod2 = lme(Peso.en.gramos~Genero+Tempo.gestación+Tipo.parto+Número.de.embarazos
           ,random = ~ 1 | Nombre.administradora,data = df)

mod3 = lme(Peso.en.gramos~Genero+Tempo.gestación+Tipo.parto+Número.de.embarazos
           ,random = ~ 1 + Número.de.embarazos | Nombre.administradora,data = df)

6. Model comparison

In this section, we are interested on comparing how good each one of our model really is. To do this, we will be performing a number of different tests in order to check whether the components of the model really are significant or not.

6.1 Hypothesis testing for the variance components (Random intercept)

Firstly, we want to check the significance of the random intercept that was included in the second model, which can be described as:

\[H_0: \sigma^2_{b0} = 0 \ Vs \ H_1: \sigma^2_{b0 > 0}\]

For the given hypothesis, we have the corresponding statistic as follows:

\[lrt = -2\times(logLik(model2)-logLik(model1)) \sim \chi^2_1\] As we usually do, we will reject the null hypothesis if the p-value (\(P(\chi^2_1 > lrt)\)) is small enough. Specifically, if it is smaller than a value of \(\alpha = 0.05\).

The resulting table for this test is presented next:

lrt = -2*(logLik(mod1)-logLik(mod2))

pvalue = pchisq(lrt,1,lower.tail = F)

resultsVar = data.frame(Statistic = lrt[1], P_value = pvalue[1])

kbl(resultsVar) %>% 
  kable_styling(latex_options = "HOLD_position")
Statistic P_value
10.03716 0.0015341

The resulting p-value for the test of interest resulted to be very small, in fact smaller than a significance level of \(\alpha = 0.05\). Because of this, we can confidently conclude that the inclusion of a random intercept does represent an improvement for the initial model. Which means it is a good idea to maintain this random intercept in the final model.

6.2 Hypothesis testing for the variance components (Random slope)

Now that we concluded that the random intercept is significant and should be kept in the model. We want to test whether the random slope is also significant or should just be dropped from the final model. This means we have to prove an hypothesis of the next form:

\[H_0: \begin{pmatrix} \sigma^2_{b0} & \sigma_{b01}\\ \sigma_{b01} & \sigma^2_{b1} \end{pmatrix} = \begin{pmatrix} \sigma^2_{b0} & 0\\ 0 & 0 \end{pmatrix} \ Vs \ H_1:\begin{pmatrix} \sigma^2_{b0} & \sigma_{b01}\\ \sigma_{b01} & \sigma^2_{b1} \end{pmatrix} = \begin{pmatrix} \sigma^2_{b0} & \sigma_{b01} \neq 0\\ \sigma_{b01} \neq 0 & \sigma^2_{b1} > 0 \end{pmatrix} \]

The statistic for this test is equivalent to the one we saw for the previous test but in this case using both the model with just the random intercept and the model that also includes the random slope, which looks like this:

\[lrt = -2\times(logLik(model2)-logLik(model3)) \sim \chi^2_1\]

As we usually do, we will reject the null hypothesis if the p-value (\(P(\chi^2_1 > lrt)\)) is small enough. Specifically, if it is smaller than a value of \(\alpha = 0.05\).

The resulting table for this second test is presented next:

lrt = -2*(logLik(mod2)-logLik(mod3))

pvalue = pchisq(lrt,1,lower.tail = F)

resultsVar = data.frame(Statistic = lrt[1], P_value = pvalue[1])

kbl(resultsVar) %>% 
  kable_styling(latex_options = "HOLD_position")
Statistic P_value
2.879189 0.0897312

For this case, we get a small p-value. However, it is not smaller than the significance level that we selected before for these tests. For that reason, we assume that the variance component for the random slope is equal to zero which means that the random slope should not be included in the final model.

Finally, based on these comparisons, it is reasonable to select the second model, so far, as the best model to describe the weight of a newborn baby. Let’s remember this model is the one that uses all the variables of interest plus a random intercept for the health company that delivered the baby.

6.3 Wald’s test

After having come to the conclusion that including a random intercept for the heatlh company in charge is significant and ultimately results in a better model, we want to prove the significance of each one of our fixed effects in the model. This is equivalent to testing the next hypotheses for each effect:

\[H_0: \beta_k = 0\text{ Vs } H_1: \beta_k \neq 0\]

Which can be tested by using Wald’s test. This test uses the next statistic to calculate the significance of these effects:

\[T_0 = \frac{\hat{\beta}_k}{\hat{SE(\hat{\beta_k})}} \sim t_{2650}\] The null hypothesis will be rejected if: \[VP = 2\times P(t_{2650} > |T_0|)\] Is sufficiently small. In this case, if it is smaller than \(\alpha = 0.1\)

Let’s then take a look at the model’s summary that contains these t statistics as well as the p values:

smMod2 = summary(mod2)

kbl(smMod2$tTable) %>% 
  kable_styling(latex_options = "HOLD_position")
Value Std.Error DF t-value p-value
(Intercept) -4276.263524 118.447627 2650 -36.1025680 0.0000000
GeneroMASCULINO 99.890780 14.934638 2650 6.6885303 0.0000000
Tempo.gestación 187.925108 3.075809 2650 61.0977892 0.0000000
Tipo.partoESPONTÁNEO 9.248851 17.633492 2650 0.5245048 0.5999714
Tipo.partoINSTRUMENTADO 59.254546 36.099664 2650 1.6414154 0.1008299
Número.de.embarazos 34.166651 5.219276 2650 6.5462431 0.0000000

We can see how every fixed effect seems to be significant since all of their respective p-values, other than for the types of birth, are smaller than 0.1.

For the type of birth, since it is a multiclass variable, we need to use an F-statistic that will look like this and accounts for the variable as a whole:

\[F_0 = \frac{MS(Tipo.parto)}{MSE} \sim F_{2,2650}\] With the respective p-value as follows:

\[VP = P(F_{2,2650} > F_0)\]

Let us look a the ANOVA table for this model, which contains these statistics and the correspondent p-values.

kbl(anova(mod2)) %>% 
  kable_styling(latex_options = "HOLD_position")
numDF denDF F-value p-value
(Intercept) 1 2650 1.742844e+04 0.0000000
Genero 1 2650 2.222761e+01 0.0000025
Tempo.gestación 1 2650 3.774225e+03 0.0000000
Tipo.parto 2 2650 4.533614e-01 0.6355377
Número.de.embarazos 1 2650 4.285330e+01 0.0000000

First, it is important to mention that the F statistics for the rest of the fixed effects, except for the gender, are equivalent to the T statistic we previously saw but squared. Which also means that the p values will continue to be the same as in the wald’s test.

Finally, we can see how the birth type does not seem to be significant since the resulting p value is much higher than \(\alpha = 0.1\), which means this variable is not really adding value to our model. However, it is of our interest to specifically check for the effect of the birth type on the babies final weight plus, it might be possible that the variable’s effect could be covered by its interaction with a different variable, so we will maintain the birth type as part of the model.

7. Model validation (residuals analysis)

We have mentioned multiple times how we have a set of assumptions that we are making in each one of our model. For that reason, before actually presenting a final model, it is mandatory to validate these assumptions in our current model, since not fulfilling any of these can be harmful for the value of said model.

7.1 Normality

qqnorm(resid(mod2))
qqline(resid(mod2))

Visually, it seems like the residuals fit the theoretical quantiles decently well, especially for the more middle values. However, it does look like that this decent fit gets a little worse as soon as we look at the tails of the distribution. However, it does not seem like we have strong evidence to believe the errors do not follow a normal distribution.

7.2 Constant variance

par(mfrow = c(2,2))
stripchart(resid(mod2)~df$Tempo.gestación,vertical=TRUE,pch=1,cex=1.5,xlab="Gestation time") 
abline(h=0,lty=2) 

stripchart(resid(mod2)~df$Genero,vertical=TRUE,pch=1,cex=1.5,xlab="Gender") 
abline(h=0,lty=2) 

stripchart(resid(mod2)~df$Número.de.embarazos,vertical=TRUE,pch=1,cex=1.5,xlab="Number of previous pregnancies") 
abline(h=0,lty=2) 

stripchart(resid(mod2)~df$Tipo.parto,vertical=TRUE,pch=1,cex=1.5,xlab="Birth type") 
abline(h=0,lty=2) 

plot(fitted(mod2),resid(mod2),pch=1,cex=1.5,xlab="Fitted values") 
abline(h=0,lty=2) 

If we take a look at the plots for the residuals versus each one of the variables, it looks like there might be some problems with the variance. This is clear since we can see how it looks like the variance increases as the gestation time gets bigger, which might lead us to believe that the variance for the fitting errors is not constant. However, this is only due to a visual effect. This “difference” in variance might only be apparent since the majority of newborn babies have gestation times higher than 30.

8. Dealing with the detected problems.

The only major problem we ended up encountering is that the variance for the errors might not be constant.

One very simple way to deal with this is by transforming our response variable in an attempt to stabilize the already mentioned variance. For that, we will use the box-cox transformation to find and optimal value for a parameter \(\lambda\) such that the transformation \(Y^\lambda\) results in the best possible logLik value.

bc = boxcox(Peso.en.gramos~Genero+Tempo.gestación+Tipo.parto+Número.de.embarazos,data = df)

(lambda = bc$x[which.max(bc$y)])
## [1] 0.7878788

It looks like the optimal value for \(\lambda\) is 0.79. Then, this is the value we will use to transform our response variable.

mod22 = lme(Peso.en.gramos^0.79~Genero+Tempo.gestación+Tipo.parto+Número.de.embarazos
           ,random = ~ 1 | Nombre.administradora,data = df)
par(mfrow = c(2,2))
stripchart(resid(mod22)~df$Tempo.gestación,vertical=TRUE,pch=1,cex=1.5,xlab="Gestation time") 
abline(h=0,lty=2) 

stripchart(resid(mod22)~df$Genero,vertical=TRUE,pch=1,cex=1.5,xlab="Gender") 
abline(h=0,lty=2) 

stripchart(resid(mod22)~df$Número.de.embarazos,vertical=TRUE,pch=1,cex=1.5,xlab="Number of previous pregnancies") 
abline(h=0,lty=2) 

stripchart(resid(mod22)~df$Tipo.parto,vertical=TRUE,pch=1,cex=1.5,xlab="Birth type") 
abline(h=0,lty=2) 

Even though it is safe to say that the variance problem has been reduced event further by using the transformation, it is still visible that there might be some complications with this assumption.

However, it is still a good idea to include this transformation since, as we will shortly see, this new model performs a lot better than the initial one we had and also helps stabilize the variance to a certain extent.

model metric value
Regular response AIC 39855.09
Regular response BIC 39902.29
Regular response logLik -19919.55
Transformed response AIC 29570.10
Transformed response BIC 29617.30
Transformed response logLik -14777.05

We can then see how the initial model with the original variable has higher AIC and BIC and a lower logLik in comparison to the model that uses the transformed response variable. Which means this last model performs slightly better than the one using no transformation.

9. Final model.

Finally, it is clear that, even though we might still have some slight problems with the variance of our errors; the inclusion of the box-cox transformation represents an amazing improvement for the model.

For that reason, the final model is as shown next.

\[{\hat{\mu}}_{ij}^{0.79} = -567.1962 + 14.636\cdot Male_{ij}+29.0288\cdot Gestation.time_{ij} \\+ 2.2714\cdot Spontaneous.birth_{ij} + 9.1851\cdot Instrumented.birth_{ij} + \\ 5.001\cdot Number.of.pregnancies_{ij} + \hat{b}_0 \\\]

\[\text{With } {\hat{\mu}}_{ij} = \hat{E}[Newborn's \ weight_{ij} | X_1 = x_1,..., X_5 = x_5]\\\]

\[b_0 \ approx \sim N(0, 10.66279^2)\]

\[i = 1, 2,...,47. \ j = 1, 2,...\] Where \(i\) represents the index for the i-th health company.

References.

  • Freddy Hernández Barajas, Jorge Leonardo López Martínez (2020-03-15). Modelos Mixtos con R.

  • Pinheiro, José, Douglas Bates, and R-core. 2021. Nlme: Linear and Nonlinear Mixed Effects Models. [https://svn.r-project.org/R-packages/trunk/nlme/].

  • Pinheiro, J., and D. Bates. 2000. Mixed-Effects Models in s and s-PLUS. 1st ed. New York: Springer.

  • Scheipl, Fabian. 2020. RLRsim: Exact (Restricted) Likelihood Ratio Tests for Mixed and Additive Models