Modello Statistico per la Previsione del Peso Neonatale

The project is part of a context of growing attention to the prevention of neonatal complications. The ability to predict the birth weight of newborns represents a key opportunity to improve clinical planning and reduce the risks associated with problematic births, such as premature births or low-birth weight infants.

Loading necessary libraries

library(dplyr)
library(lmtest)
library(ggplot2)
library(MASS)
library(ggeffects)
library(moments) 
library(knitr) 
library(kableExtra)

1) Data collection and Dataset structure

I begin with an exploration of the dataset:

setwd("~/File eseguibili R/dataset")
data = read.csv("neonati.csv", sep = ',',stringsAsFactors = T)
summary(data)

##    Anni.madre     N.gravidanze       Fumatrici        Gestazione   
##  Min.   : 0.00   Min.   : 0.0000   Min.   :0.0000   Min.   :25.00  
##  1st Qu.:25.00   1st Qu.: 0.0000   1st Qu.:0.0000   1st Qu.:38.00  
##  Median :28.00   Median : 1.0000   Median :0.0000   Median :39.00  
##  Mean   :28.16   Mean   : 0.9812   Mean   :0.0416   Mean   :38.98  
##  3rd Qu.:32.00   3rd Qu.: 1.0000   3rd Qu.:0.0000   3rd Qu.:40.00  
##  Max.   :46.00   Max.   :12.0000   Max.   :1.0000   Max.   :43.00  
##       Peso        Lunghezza         Cranio    Tipo.parto Ospedale   Sesso   
##  Min.   : 830   Min.   :310.0   Min.   :235   Ces: 728   osp1:816   F:1256  
##  1st Qu.:2990   1st Qu.:480.0   1st Qu.:330   Nat:1772   osp2:849   M:1244  
##  Median :3300   Median :500.0   Median :340              osp3:835           
##  Mean   :3284   Mean   :494.7   Mean   :340                                 
##  3rd Qu.:3620   3rd Qu.:510.0   3rd Qu.:350                                 
##  Max.   :4930   Max.   :565.0   Max.   :390

head(data)

##   Anni.madre N.gravidanze Fumatrici Gestazione Peso Lunghezza Cranio Tipo.parto
## 1         26            0         0         42 3380       490    325        Nat
## 2         21            2         0         39 3150       490    345        Nat
## 3         34            3         0         38 3640       500    375        Nat
## 4         28            1         0         41 3690       515    365        Nat
## 5         20            0         0         38 3700       480    335        Nat
## 6         32            0         0         40 3200       495    340        Nat
##   Ospedale Sesso
## 1     osp3     M
## 2     osp1     F
## 3     osp2     M
## 4     osp2     M
## 5     osp3     F
## 6     osp2     F

str(data)

## 'data.frame':    2500 obs. of  10 variables:
##  $ Anni.madre  : int  26 21 34 28 20 32 26 25 22 23 ...
##  $ N.gravidanze: int  0 2 3 1 0 0 1 0 1 0 ...
##  $ Fumatrici   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Gestazione  : int  42 39 38 41 38 40 39 40 40 41 ...
##  $ Peso        : int  3380 3150 3640 3690 3700 3200 3100 3580 3670 3700 ...
##  $ Lunghezza   : int  490 490 500 515 480 495 480 510 500 510 ...
##  $ Cranio      : int  325 345 375 365 335 340 345 349 335 362 ...
##  $ Tipo.parto  : Factor w/ 2 levels "Ces","Nat": 2 2 2 2 2 2 2 2 1 1 ...
##  $ Ospedale    : Factor w/ 3 levels "osp1","osp2",..: 3 1 2 2 3 2 3 1 2 2 ...
##  $ Sesso       : Factor w/ 2 levels "F","M": 2 1 2 2 1 1 1 2 1 1 ...

2) Analyisis and modelling

Let’s take a look at the density and boxplot of the variables, (just the numerical variables)

attach(data)
list_features = c('Anni.madre','N.gravidanze','Gestazione','Peso','Lunghezza','Cranio')

# Set up the plotting area
par(mfrow = c(1, 2)) 

for (i in list_features){
  boxplot(data[[i]], main = i)
  
  #label generation
  x_label <- gsub("\\.", " ", tolower(i))
  if (i == 'Gestazione'){
    x_label = 'mesi di gestazione'
  }
  
  if (i == 'N.gravidanze' || i == 'Gestazione' || i == 'Anni.madre'){
    hist(data[[i]], main = i, xlab = x_label)
  }  else{
  plot(density(data[[i]]), main = i)
  }
}

Looking at the boxplot showing the age of the mother at birth, it seems clear that a couple of value cannot be realistic (age lower than 10 yo). I’ll proceed analyzing these records and later on remove these.

incorrect_values <- data[data$Anni.madre < 10,]
# Print as a formatted table
kable(incorrect_values, caption = "Records with 'Anni.madre' < 10") %>%
  kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover"))

Records with ‘Anni.madre’ < 10
	Anni.madre	N.gravidanze	Fumatrici	Gestazione	Peso	Lunghezza	Cranio	Tipo.parto	Ospedale	Sesso
1152	1	1	0	41	3250	490	350	Nat	osp2	F
1380	0	0	0	39	3060	490	330	Nat	osp3	M

data <- data[!(data$Anni.madre < 10 ), ]

Even though the other values recorded for these two observations seem correct, (so probably the age was misspelled), I’m going to remove them from the dataset. If Anni.madre was an important variable to determine the weight of the baby, these two records would introduce a huge bias in the estimation of the correct \(\beta\) parameter.

Now let’s take a closer look at the categorical features

# Create a table and convert it to a data frame
smoking_table <- as.data.frame(table(data$Fumatrici))

# Rename columns for clarity
colnames(smoking_table) <- c("Maternal Smoking", "Frequency")

# Print as a formatted table
kable(smoking_table, 
      caption = "Distribution of Maternal Smoking",
      col.names = c("Maternal Smoking", "Frequency")) %>%
  kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover"))

Distribution of Maternal Smoking
Maternal Smoking	Frequency
0	2394
1	104

boxplot(data$Peso ~ data$Fumatrici, 
        main = "Boxplot of Birth Weight by Maternal Smoking",
        xlab = "Maternal Smoking (0 = Non-Smoker, 1 = Smoker)", 
        ylab = "Birth Weight (grams)",
        col = c("green", "red"),
        border = "black")

birth_type_table <- as.data.frame(table(data$Tipo.parto))
colnames(birth_type_table) <- c("Birth Type", "Frequency")
kable(birth_type_table, 
      caption = "Distribution of Birth Type",
      col.names = c("Birth Type", "Frequency")) %>%
  kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover"))

Distribution of Birth Type
Birth Type	Frequency
Ces	728
Nat	1770

boxplot(data$Peso ~ data$Tipo.parto, 
        main = "Boxplot of Birth Weight by Delivery Type",
        xlab = "Delivery Type (Ces = Cesarean, Nat = Natural)", 
        ylab = "Birth Weight (grams)",
        col = c("lightblue", "lightpink"),
        border = "black")

# Table for Hospital (Ospedale)
hospital_table <- as.data.frame(table(data$Ospedale))
colnames(hospital_table) <- c("Hospital", "Frequency")

kable(hospital_table, caption = "Distribution of Births by Hospital") %>%
  kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover"))

Distribution of Births by Hospital
Hospital	Frequency
osp1	816
osp2	848
osp3	834

# Boxplot for Hospital
boxplot(data$Peso ~ data$Ospedale,
        main = "Birth Weight by Hospital",
        xlab = "Hospital",
        ylab = "Birth Weight (grams)",
        col = c("lightblue", "lightgreen", "pink"), # Add colors
        border = "black")

sex_table <- as.data.frame(table(data$Sesso))
colnames(sex_table) <- c("Sex", "Frequency")

kable(sex_table, caption = "Distribution of Births by Sex") %>%
  kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover"))

Distribution of Births by Sex
Sex	Frequency
F	1255
M	1243

# Boxplot for Sex
boxplot(data$Peso ~ data$Sesso,
        main = "Birth Weight by Sex",
        xlab = "Sex (F = Female, M = Male)",
        ylab = "Birth Weight (grams)",
        col = c("lightblue", "lightpink"),
        border = "black")

Looking at boxplot printed it seems that “Fumatrici” and “Sex” could be useful features to predict the weight of the baby, since the boxplots have a slightly different shape considering these distinctions.

Now in the image below we can see the graphical representation of one variable of the dataset as a function of another to understand any relation. In addition, we also have a measure of the pairwise correlations among the variables. This visualization will be useful when creating the regression model, because it gives us information on which variables should be considered in the model.

panel.cor <- function(x, y, digits = 2, prefix = "", cex.cor, ...)
{
  par(usr = c(0, 1, 0, 1))
  r <- abs(cor(x, y))
  txt <- format(c(r, 0.123456789), digits = digits)[1]
  txt <- paste0(prefix, txt)
  if(missing(cex.cor)) cex.cor <- 0.8/strwidth(txt)
  text(0.5, 0.5, txt, cex = cex.cor * r)
}


pairs(data[,c("Anni.madre","N.gravidanze","Gestazione","Peso","Lunghezza","Cranio")], upper.panel = panel.smooth, lower.panel = panel.cor)

1 - Let’s check whether in some hospitals we have more cesarean births than in others

Since we are considering categorical features, assuming the usual hypotesis of random sampling and uncorrelation and independence in the elements of the sample we can use a \(X^2\) test

# Relative frequency of cesarean births by hospital
birth_table <- table(data$Ospedale, data$Tipo.parto)

relative_frequencies <- prop.table(birth_table, margin = 1)[, "Ces"]

df_rf <- data.frame(
  Categoria = names(relative_frequencies),
  Frequenza = as.numeric(relative_frequencies)
)

# Create table with kable
kable(df_rf, 
      col.names = c("Hospital", "Relative Frequency"),
      digits = 3,              # numero di cifre decimali
      caption = "Relative Frequency of Hospital Births Considering only Cesarean births") %>% 
  kable_styling(full_width = FALSE)

Relative Frequency of Hospital Births Considering only Cesarean births
Hospital	Relative Frequency
osp1	0.297
osp2	0.300
osp3	0.278

# Temporary table
tab <- table(data$Ospedale, data$Tipo.parto)

# Test del chi-quadrato
chi_test <- chisq.test(tab)
chi_test

## 
##  Pearson's Chi-squared test
## 
## data:  tab
## X-squared = 1.083, df = 2, p-value = 0.5819

Since we have a p-value of 0.58 we can firmly affirm that the number of Cesearan births is not significantly different from one hospital to the other, we cannot reject \(H_0\). We had also an hint of this looking at the relative frequency of ceasarean birth in each hospital, they are very close one to each other.

2 - The average weight and length of this sample of infants are significantly equal to those of the population

Looking at data available online it seems that the medium weight at birth is about 3300 g, while the length is 500mm

ospedalebambinogesu

my-personaltrainer

Since we don’t know the real distribution of these features (hence neither the real variance) we can proceed using a t_test.

mu_weight = 3300
mu_length = 500

# T Test for weight
peso_test <- t.test(data$Peso, mu = mu_weight) 
peso_test

## 
##  One Sample t-test
## 
## data:  data$Peso
## t = -1.505, df = 2497, p-value = 0.1324
## alternative hypothesis: true mean is not equal to 3300
## 95 percent confidence interval:
##  3263.577 3304.791
## sample estimates:
## mean of x 
##  3284.184

# T Test for length
lunghezza_test <- t.test(data$Lunghezza, mu = mu_length)
lunghezza_test

## 
##  One Sample t-test
## 
## data:  data$Lunghezza
## t = -10.069, df = 2497, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 500
## 95 percent confidence interval:
##  493.6628 495.7287
## sample estimates:
## mean of x 
##  494.6958

Looking at these results we can conclude that the weights of the babies in this sample is not significantly different from the considered true mean (if we use a classic 5 % significance level), while a different consideration has to be done for the length, since the p-value is close to 0 we can be sure that the mean length is significantly different from 500 mm

3 - Check whether the weight (or length) of the baby is significantly different considering one sex with respect to the other

Also in this case I proceed with a t-test

# Test t for weight
peso_sesso_test <- t.test(Peso ~ Sesso, data = data)
peso_sesso_test

## 
##  Welch Two Sample t-test
## 
## data:  Peso by Sesso
## t = -12.115, df = 2488.7, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group F and group M is not equal to 0
## 95 percent confidence interval:
##  -287.4841 -207.3844
## sample estimates:
## mean in group F mean in group M 
##        3161.061        3408.496

# Test t for length
lunghezza_sesso_test <- t.test(Lunghezza ~ Sesso, data = data)
lunghezza_sesso_test

## 
##  Welch Two Sample t-test
## 
## data:  Lunghezza by Sesso
## t = -9.5823, df = 2457.3, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group F and group M is not equal to 0
## 95 percent confidence interval:
##  -11.939001  -7.882672
## sample estimates:
## mean in group F mean in group M 
##        489.7641        499.6750

In both tests we reject \(H_0\), so this means that the newborn is significantly different if we consider a male or a female (hence the sex variable will be almost certainly included in the regression model)

Creation regression model

I start checking normality in the target variable.

# Compute skewness and round to two decimal places
skewness_value <- round(moments::skewness(data$Peso), 2)
cat("Skewness:", skewness_value, "\n")

## Skewness: -0.65

# Compute excess kurtosis and round to two decimal places
kurtosis_value <- round(moments::kurtosis(data$Peso) - 3, 2)
cat("Excess Kurtosis:", kurtosis_value, "\n")

## Excess Kurtosis: 2.03

shapiro.test(data$Peso)

## 
##  Shapiro-Wilk normality test
## 
## data:  data$Peso
## W = 0.97068, p-value < 2.2e-16

As we would expect we have to refuse normality in the target variable (although the distribution has many low values, the skewness is negative overall, and the distribution is a bit peaky, giving positive kurtosis). Therefore we won’t expect normal residuals, when we’ll proceed with residuals validation.

Giving the considerations done in the previous steps, I would expect that the relevant variables to describe the weight are

“Fumatrici”
“Sex”
“Gestazione”
“Lunghezza”
“Cranio”

(the last three variables are a bit correlated one to each other, so I’ll have to check for multicollinearity and if it is valuable to use them all).

# let's start with a model with everything

mod1 = lm(Peso~., data= data)
summary(mod1)

## 
## Call:
## lm(formula = Peso ~ ., data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1123.26  -181.53   -14.45   161.05  2611.89 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -6735.7960   141.4790 -47.610  < 2e-16 ***
## Anni.madre        0.8018     1.1467   0.699   0.4845    
## N.gravidanze     11.3812     4.6686   2.438   0.0148 *  
## Fumatrici       -30.2741    27.5492  -1.099   0.2719    
## Gestazione       32.5773     3.8208   8.526  < 2e-16 ***
## Lunghezza        10.2922     0.3009  34.207  < 2e-16 ***
## Cranio           10.4722     0.4263  24.567  < 2e-16 ***
## Tipo.partoNat    29.6335    12.0905   2.451   0.0143 *  
## Ospedaleosp2    -11.0912    13.4471  -0.825   0.4096    
## Ospedaleosp3     28.2495    13.5054   2.092   0.0366 *  
## SessoM           77.5723    11.1865   6.934 5.18e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 274 on 2487 degrees of freedom
## Multiple R-squared:  0.7289, Adjusted R-squared:  0.7278 
## F-statistic: 668.7 on 10 and 2487 DF,  p-value: < 2.2e-16

As we would expect giving to the model all the variables we obtain a bit a mess.

Let’s try again removing Anni.madre and Ospedale which for sure are not a relevant factors (actually Ospedale 3 seems to differ a bit from the other two, but anyway the p value is not so low)

# removing anni.madre and Ospedale

mod2 = update(mod1,~.-Anni.madre-Ospedale)
summary(mod2)

## 
## Call:
## lm(formula = Peso ~ N.gravidanze + Fumatrici + Gestazione + Lunghezza + 
##     Cranio + Tipo.parto + Sesso, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1129.88  -181.36   -16.24   160.63  2634.62 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -6708.8092   136.0640 -49.306  < 2e-16 ***
## N.gravidanze     12.9927     4.3439   2.991  0.00281 ** 
## Fumatrici       -31.8823    27.5803  -1.156  0.24780    
## Gestazione       32.5970     3.8039   8.569  < 2e-16 ***
## Lunghezza        10.2684     0.3011  34.098  < 2e-16 ***
## Cranio           10.5015     0.4262  24.637  < 2e-16 ***
## Tipo.partoNat    30.4244    12.1041   2.514  0.01201 *  
## SessoM           78.1031    11.1998   6.974 3.94e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 274.4 on 2490 degrees of freedom
## Multiple R-squared:  0.7278, Adjusted R-squared:  0.7271 
## F-statistic: 951.3 on 7 and 2490 DF,  p-value: < 2.2e-16

I tried removing other variables before Fumatrici, because I would expect it to be relevant, but it’s not (although the estimate is quite big, there is high variability, hence the standard error is very high giving a p value very high).

Instead N.Gravidanze and the type of birth seems to have effects on the description of the target variable.

# remove fumatrici

mod3 = update(mod2,~.-Fumatrici)
summary(mod3)

## 
## Call:
## lm(formula = Peso ~ N.gravidanze + Gestazione + Lunghezza + Cranio + 
##     Tipo.parto + Sesso, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1129.14  -181.97   -16.26   160.95  2638.18 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -6708.0171   136.0715 -49.298  < 2e-16 ***
## N.gravidanze     12.7356     4.3385   2.935  0.00336 ** 
## Gestazione       32.3253     3.7969   8.514  < 2e-16 ***
## Lunghezza        10.2833     0.3009  34.177  < 2e-16 ***
## Cranio           10.5063     0.4263  24.648  < 2e-16 ***
## Tipo.partoNat    30.1601    12.1027   2.492  0.01277 *  
## SessoM           77.9171    11.1994   6.957 4.42e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 274.4 on 2491 degrees of freedom
## Multiple R-squared:  0.7277, Adjusted R-squared:  0.727 
## F-statistic:  1109 on 6 and 2491 DF,  p-value: < 2.2e-16

Here we could already be somewhat happy with the results obtained, the adjusted r squared is still high 0.73, and very close to the one obtained from the model with all the regressors. But since we look for semplicity and a parsimonious model, I’ll try in the next steps at removing other variables and see what happens.

# remove Tipo.parto and n.gravidanze

mod4 = update(mod3,~.-Tipo.parto-N.gravidanze)
summary(mod4)

## 
## Call:
## lm(formula = Peso ~ Gestazione + Lunghezza + Cranio + Sesso, 
##     data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1138.1  -184.4   -17.4   163.3  2626.7 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -6651.6732   135.5952 -49.055  < 2e-16 ***
## Gestazione     31.3262     3.7884   8.269  < 2e-16 ***
## Lunghezza      10.2024     0.3009  33.909  < 2e-16 ***
## Cranio         10.6706     0.4247  25.126  < 2e-16 ***
## SessoM         79.1027    11.2205   7.050 2.31e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 275.1 on 2493 degrees of freedom
## Multiple R-squared:  0.7261, Adjusted R-squared:  0.7257 
## F-statistic:  1652 on 4 and 2493 DF,  p-value: < 2.2e-16

Removing Tipo.parto and n.gravidanze almost doesn’t affect the results (R-squared still 0.73). So I’ll go on excluding them from the model.

Now I would like to check whether we have multicollinearity in the remaining variables.

vif_values <- car::vif(mod4)

# Convert VIF results to a data frame (fixing the structure)
vif_table <- data.frame(
  Variable = rownames(as.data.frame(vif_values)),  
  VIF = round(as.numeric(vif_values), 2)         
)

# Print formatted table
kable(vif_table, 
      caption = "Variance Inflation Factor (VIF) for Model Variables",
      col.names = c("Variable", "VIF")) %>%
  kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover"))

Variance Inflation Factor (VIF) for Model Variables
Variable	VIF
Gestazione	1.65
Lunghezza	2.07
Cranio	1.61
Sesso	1.04

We can see a moderate correlation among these values, but the vif values are still far under 5. Let’s try anyway to remove one of these variables and see how the model changes. We could try removing Cranio variable.

mod5 = update(mod4,~.-Cranio)
summary(mod5)

## 
## Call:
## lm(formula = Peso ~ Gestazione + Lunghezza + Sesso, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1138.2  -193.1   -22.9   186.8  3618.5 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -5223.6509   137.7911 -37.910  < 2e-16 ***
## Gestazione     44.4964     4.1994  10.596  < 2e-16 ***
## Lunghezza      13.6010     0.3008  45.215  < 2e-16 ***
## SessoM         90.4500    12.5484   7.208 7.49e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 307.9 on 2494 degrees of freedom
## Multiple R-squared:  0.6568, Adjusted R-squared:  0.6563 
## F-statistic:  1591 on 3 and 2494 DF,  p-value: < 2.2e-16

Here the R-squared drops significantly, therefore we should keep this feature in the model

Let’s try to see if we have interactions among some variables (i can’t think at any possible interaction that makes sense a part from Sesso combined with other features, let’s try with gestazione)

# i start from model 4 since it seemed the best so far
mod6 = update(mod4,~.+Sesso:Gestazione)
summary(mod6)

## 
## Call:
## lm(formula = Peso ~ Gestazione + Lunghezza + Cranio + Sesso + 
##     Gestazione:Sesso, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1133.40  -183.31   -17.02   162.34  2625.26 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       -6540.5979   166.9409 -39.179  < 2e-16 ***
## Gestazione           28.3984     4.5760   6.206 6.35e-10 ***
## Lunghezza            10.2066     0.3009  33.922  < 2e-16 ***
## Cranio               10.6715     0.4247  25.130  < 2e-16 ***
## SessoM             -188.8609   235.2241  -0.803    0.422    
## Gestazione:SessoM     6.8666     6.0208   1.140    0.254    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 275.1 on 2492 degrees of freedom
## Multiple R-squared:  0.7263, Adjusted R-squared:  0.7257 
## F-statistic:  1322 on 5 and 2492 DF,  p-value: < 2.2e-16

This clearly makes the model worse than before.

Let’s check if we have non linear effects: Gestazione seems to have a quadratic effect (especially at low months).

mod7 = update(mod4,~.+I(Gestazione^2))
summary(mod7)

## 
## Call:
## lm(formula = Peso ~ Gestazione + Lunghezza + Cranio + Sesso + 
##     I(Gestazione^2), data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1132.76  -181.60   -15.82   162.79  2649.15 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -4640.6017   899.9572  -5.156 2.71e-07 ***
## Gestazione        -80.9464    49.8136  -1.625   0.1043    
## Lunghezza          10.3056     0.3041  33.892  < 2e-16 ***
## Cranio             10.7671     0.4265  25.247  < 2e-16 ***
## SessoM             76.9130    11.2530   6.835 1.03e-11 ***
## I(Gestazione^2)     1.4988     0.6631   2.260   0.0239 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 274.9 on 2492 degrees of freedom
## Multiple R-squared:  0.7267, Adjusted R-squared:  0.7261 
## F-statistic:  1325 on 5 and 2492 DF,  p-value: < 2.2e-16

mod8 = update(mod7,~.-Gestazione)
summary(mod8)

## 
## Call:
## lm(formula = Peso ~ Lunghezza + Cranio + Sesso + I(Gestazione^2), 
##     data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1135.54  -183.52   -17.07   164.34  2630.57 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -6089.1448   123.7225 -49.216  < 2e-16 ***
## Lunghezza          10.2139     0.2989  34.173  < 2e-16 ***
## Cranio             10.6907     0.4240  25.213  < 2e-16 ***
## SessoM             78.4604    11.2164   6.995 3.39e-12 ***
## I(Gestazione^2)     0.4244     0.0504   8.421  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 275 on 2493 degrees of freedom
## Multiple R-squared:  0.7264, Adjusted R-squared:  0.7259 
## F-statistic:  1655 on 4 and 2493 DF,  p-value: < 2.2e-16

The last model considering quadratic Gestazione has a slighly better adjusted R-squared than the same model with linear Gestazione (mod4).

Now let’s look at the AIC and BIC values of all the models created (I expect model 4 and 8 to be the best).

# Compute AIC and convert it into a data frame
aic_values <- AIC(mod1, mod2, mod3, mod4, mod5, mod6, mod7, mod8)
aic_table <- data.frame(Model = rownames(aic_values), 
                        DF = aic_values$df, 
                        AIC = round(aic_values$AIC, 2))

# Print formatted AIC table
kable(aic_table, 
      caption = "Akaike Information Criterion (AIC) for Model Comparison",
      col.names = c("Model", "Degrees of Freedom", "AIC")) %>%
  kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover"))

Akaike Information Criterion (AIC) for Model Comparison
Model	Degrees of Freedom	AIC
mod1	12	35145.57
mod2	9	35149.33
mod3	8	35148.67
mod4	6	35159.12
mod5	5	35721.00
mod6	7	35159.82
mod7	7	35156.01
mod8	6	35156.65

# Compute BIC and convert it into a data frame
bic_values <- BIC(mod1, mod2, mod3, mod4, mod5, mod6, mod7, mod8)
bic_table <- data.frame(Model = rownames(bic_values), 
                        DF = bic_values$df, 
                        BIC = round(bic_values$BIC, 2))

# Print formatted BIC table
kable(bic_table, 
      caption = "Bayesian Information Criterion (BIC) for Model Comparison",
      col.names = c("Model", "Degrees of Freedom", "BIC")) %>%
  kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover"))

Bayesian Information Criterion (BIC) for Model Comparison
Model	Degrees of Freedom	BIC
mod1	12	35215.45
mod2	9	35201.73
mod3	8	35195.25
mod4	6	35194.06
mod5	5	35750.11
mod6	7	35200.58
mod7	7	35196.77
mod8	6	35191.59

AIC tends to prefer a bit more complex models, hence it give the lowest score to the first model. But if look at BIC it is clear that the best is the last one (one interesting things to notice is how worse the model 5 is, so we cannot remove Cranio although is highly correlated with Lunghezza).

Let’s try to check what we get if we use stepAIC function.

n = nrow(data)


stepwise.mod = MASS::stepAIC(mod1,direction="both",k=log(n))

## Start:  AIC=28118.61
## Peso ~ Anni.madre + N.gravidanze + Fumatrici + Gestazione + Lunghezza + 
##     Cranio + Tipo.parto + Ospedale + Sesso
## 
##                Df Sum of Sq       RSS   AIC
## - Anni.madre    1     36710 186779904 28111
## - Fumatrici     1     90677 186833870 28112
## - Ospedale      2    687555 187430749 28112
## - N.gravidanze  1    446244 187189438 28117
## - Tipo.parto    1    451073 187194266 28117
## <none>                      186743194 28119
## - Sesso         1   3610705 190353899 28159
## - Gestazione    1   5458852 192202046 28183
## - Cranio        1  45318506 232061700 28654
## - Lunghezza     1  87861708 274604902 29074
## 
## Step:  AIC=28111.28
## Peso ~ N.gravidanze + Fumatrici + Gestazione + Lunghezza + Cranio + 
##     Tipo.parto + Ospedale + Sesso
## 
##                Df Sum of Sq       RSS   AIC
## - Fumatrici     1     91599 186871503 28105
## - Ospedale      2    693914 187473818 28105
## - Tipo.parto    1    452049 187231953 28110
## <none>                      186779904 28111
## - N.gravidanze  1    631082 187410986 28112
## + Anni.madre    1     36710 186743194 28119
## - Sesso         1   3617809 190397713 28151
## - Gestazione    1   5424800 192204704 28175
## - Cranio        1  45569477 232349381 28649
## - Lunghezza     1  87852027 274631931 29066
## 
## Step:  AIC=28104.68
## Peso ~ N.gravidanze + Gestazione + Lunghezza + Cranio + Tipo.parto + 
##     Ospedale + Sesso
## 
##                Df Sum of Sq       RSS   AIC
## - Ospedale      2    702925 187574428 28098
## - Tipo.parto    1    444404 187315907 28103
## <none>                      186871503 28105
## - N.gravidanze  1    608136 187479640 28105
## + Fumatrici     1     91599 186779904 28111
## + Anni.madre    1     37633 186833870 28112
## - Sesso         1   3601860 190473363 28145
## - Gestazione    1   5358199 192229702 28168
## - Cranio        1  45613331 232484834 28642
## - Lunghezza     1  88259386 275130889 29063
## 
## Step:  AIC=28098.41
## Peso ~ N.gravidanze + Gestazione + Lunghezza + Cranio + Tipo.parto + 
##     Sesso
## 
##                Df Sum of Sq       RSS   AIC
## - Tipo.parto    1    467626 188042054 28097
## <none>                      187574428 28098
## - N.gravidanze  1    648873 188223301 28099
## + Ospedale      2    702925 186871503 28105
## + Fumatrici     1    100610 187473818 28105
## + Anni.madre    1     44184 187530244 28106
## - Sesso         1   3644818 191219246 28139
## - Gestazione    1   5457887 193032315 28162
## - Cranio        1  45747094 233321522 28636
## - Lunghezza     1  87955701 275530129 29051
## 
## Step:  AIC=28096.81
## Peso ~ N.gravidanze + Gestazione + Lunghezza + Cranio + Sesso
## 
##                Df Sum of Sq       RSS   AIC
## <none>                      188042054 28097
## - N.gravidanze  1    621053 188663107 28097
## + Tipo.parto    1    467626 187574428 28098
## + Ospedale      2    726146 187315907 28103
## + Fumatrici     1     92548 187949505 28103
## + Anni.madre    1     45366 187996688 28104
## - Sesso         1   3650790 191692844 28137
## - Gestazione    1   5477493 193519547 28161
## - Cranio        1  46098547 234140601 28637
## - Lunghezza     1  87532691 275574744 29044

summary(stepwise.mod)

## 
## Call:
## lm(formula = Peso ~ N.gravidanze + Gestazione + Lunghezza + Cranio + 
##     Sesso, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1149.37  -180.98   -15.57   163.69  2639.09 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -6681.7251   135.8036 -49.201  < 2e-16 ***
## N.gravidanze    12.4554     4.3416   2.869  0.00415 ** 
## Gestazione      32.3827     3.8008   8.520  < 2e-16 ***
## Lunghezza       10.2455     0.3008  34.059  < 2e-16 ***
## Cranio          10.5410     0.4265  24.717  < 2e-16 ***
## SessoM          77.9807    11.2111   6.956 4.47e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 274.7 on 2492 degrees of freedom
## Multiple R-squared:  0.727,  Adjusted R-squared:  0.7265 
## F-statistic:  1327 on 5 and 2492 DF,  p-value: < 2.2e-16

The automatic selection of the model includes also the N.gravidanze feature, which I had removed in my process of selection of the model. Let’s try modifying model with quadratic Gestazione adding N.gravidanze and see if we could improve BIC value.

mod9 = update(mod8,~.+N.gravidanze)
summary(mod9)

## 
## Call:
## lm(formula = Peso ~ Lunghezza + Cranio + Sesso + I(Gestazione^2) + 
##     N.gravidanze, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1146.79  -181.16   -14.75   165.39  2643.24 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -6.100e+03  1.236e+02 -49.355  < 2e-16 ***
## Lunghezza        1.026e+01  2.988e-01  34.326  < 2e-16 ***
## Cranio           1.056e+01  4.258e-01  24.805  < 2e-16 ***
## SessoM           7.731e+01  1.121e+01   6.898 6.64e-12 ***
## I(Gestazione^2)  4.386e-01  5.057e-02   8.674  < 2e-16 ***
## N.gravidanze     1.253e+01  4.340e+00   2.889   0.0039 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 274.6 on 2492 degrees of freedom
## Multiple R-squared:  0.7273, Adjusted R-squared:  0.7268 
## F-statistic:  1329 on 5 and 2492 DF,  p-value: < 2.2e-16

bic_values <- BIC(stepwise.mod, mod8, mod9)

# Convert BIC results to a data frame
bic_table <- data.frame(
  Model = rownames(bic_values), 
  DF = bic_values$df, 
  BIC = round(bic_values$BIC, 2)  # Round to 2 decimal places
)

# Print formatted BIC table
kable(bic_table, 
      caption = "Bayesian Information Criterion (BIC) for Model Comparison",
      col.names = c("Model", "Degrees of Freedom", "BIC")) %>%
  kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover"))

Bayesian Information Criterion (BIC) for Model Comparison
Model	Degrees of Freedom	BIC
stepwise.mod	7	35193.65
mod8	6	35191.59
mod9	7	35191.06

The output given from the stepwise mod function is not as good as model 8 or 9. Model 9 has a slightly better BIC the model 8 (also an adjusted R-squared which is a bit better), hence I will conclude that this model 9 is the best among those tried.

Analysis obtained model

The final considered model is \[ \hat{Y}_{\text{peso}} = -6100 + 10.26 \times \text{Lunghezza} + 10.56 \times \text{Cranio} + 77.31 \times \text{SessoM} + 0.4386 \times (\text{Gestazione})^2 + 12.53 \times \text{N.gravidanze}. \] Looking at the \(R^2\) value (0.73) we can say that the model is able to “explain” 73% of the variability of the model, which is quite a good value in a real case.

Now it’s time to make an analysis on the residuals of the model.

par(mfrow=c(1,2))
plot(residuals(mod9))
abline(h=mean(residuals(mod9)),col=2)
plot(density(residuals(mod9)))

bptest(mod9) # no constant variance

## 
##  studentized Breusch-Pagan test
## 
## data:  mod9
## BP = 90.909, df = 5, p-value < 2.2e-16

dwtest(mod9) # no autocorrelation

## 
##  Durbin-Watson test
## 
## data:  mod9
## DW = 1.9526, p-value = 0.1178
## alternative hypothesis: true autocorrelation is greater than 0

shapiro.test(residuals(mod9)) # here we refuse normality

## 
##  Shapiro-Wilk normality test
## 
## data:  residuals(mod9)
## W = 0.97413, p-value < 2.2e-16

The residuals seems to be evenly spread along the zero line, so this is a good sign. Nevertheless some values seem to be far away from the 0 value, in particular there is a record with a huge error value, I will inspect it later.

Breusch-Pagan test gives back a p-value close to 0, hence we must refuse homoskedasticity (probabily this is due to some outliers). Durbin-Watson instead gives a p-value of 0.12, which indicates we cannot refuse the hypotesis of independece among the residuals. There is no autocorrelation among the residuals. Shapiro test has a p-value close to zero indicating that the residuals don’t follow a normal distribution, but we could expect this fact considering that the target variable PESO is not normal either.

Now it’s time to inspec leverages and outliers:

## leverage
lev = hatvalues(mod9)
plot(lev)
p=sum(lev)
#threshold which indicates leverage values
threshold = 2*p/n
abline(h=threshold,col=2)

# outliers
plot(rstudent(mod9))
abline(h=c(-2,2),col=2)

cook = cooks.distance(mod9)
plot(cook)

max_cook <- round(max(cook), 3)
cat("Maximum Cook's Distance:", max_cook, "\n")

## Maximum Cook's Distance: 0.825

From the first plot we can notice many leverage values, there are also quite a lot of outliers in the residuals, but the major inconvenient comes out looking at cook distance plot, there is a value equal to 0.825.

Let’s check in detail this record:

number_influent_record <- which(unname(cook) > 0.5)
influent_record = data[number_influent_record,]
# Print formatted influential record table
kable(influent_record, 
      caption = "Details of Influential Records (Cook's Distance > 0.5)") %>%
  kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover"))

Details of Influential Records (Cook’s Distance > 0.5)
	Anni.madre	N.gravidanze	Fumatrici	Gestazione	Peso	Lunghezza	Cranio	Tipo.parto	Ospedale	Sesso
1551	35	1	0	38	4370	315	374	Nat	osp3	F

Looking at this record we can clearly notice that there is an issue with the Lunghezza field. We have quite an heavy baby, but a length of the body which doesn’t match. I don’t have enough information to understand whether this is just an error while recording the data, or if it is due to some strange illness (maybe some form of Nanism). What we know is that for sure it is influent and has a great impact on our model.

Overall the metrics associated with the residuals don’t provide very good results. My opinion is that the model in most cases can capture quite well the weight of a baby, but there are some particular cases (especially the one analyzed before) in which the predicted weight is different from the real one, and this could be due to other factors which are not in the dataset of this project. There could be strange pathologies that can affect a lot PESO feature and hence provide a huge variability in the results, this can be also the reason why the residuals don’t have constant variance.

3. Prediction and Results

Let’s now make some prediction using the model described in the previous section. Let’s estimate the weight of a baby girl born from a mother at the third pregnancy at the 39th week.

My model consider also LUNGHEZZA and CRANIO as regressors, hence for these values I will consider the mean.

mean_lunghezza <- mean(data$Lunghezza, na.rm = TRUE)
mean_cranio <- mean(data$Cranio, na.rm = TRUE)
gender = 'F'
n_pregnancy = 3
weeks = 39

# Create a data frame with the required predictors
new_data <- data.frame(
  Lunghezza = mean_lunghezza,
  Cranio = mean_cranio,
  Sesso = factor(gender), 
  Gestazione = weeks,
  N.gravidanze = n_pregnancy
)


predicted_value <- predict(mod9, newdata = new_data)
# Convert to data frame and round the value
pred_table = data.frame(Predicted_Weight = round(predicted_value, 2))
kable(pred_table, 
      caption = "Predicted Birth Weight Based on Model 9", 
      col.names = c("Predicted Weight (grams)")) %>%
  kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover"))

Predicted Birth Weight Based on Model 9
Predicted Weight (grams)
3270.18

4. Visualizations

As first visualization i would like to have an idea of the difference between observed values and those estimated by the model

data$predicted <- predict(mod9)
data$residuals <- residuals(mod9)

ggplot(data, aes(x = predicted, y = Peso)) +
  geom_point(alpha = 0.4) +
  geom_abline(slope = 1, intercept = 0, color = "red", linetype = "dashed") +
  labs(x = "Predicted Weight", y = "Observed Weight",
       title = "Comparison of Observed and Predicted Weight") +
  theme_minimal()

From the plot we can notice how the model perform quite well for weight around the medium - high weights, but tends to underestimate the weight when the observed value is low. We can say that in case of low weights the model is not reliable.

Now, for example, we can check how the Gestazione feature affect the weight.

min(Gestazione)

## [1] 25

max(Gestazione)

## [1] 43

# Restrict gestation weeks from 25 to 43 in steps of 1
eff_gest <- ggpredict(mod9, terms = "Gestazione [25:43 by=1]")

ggplot(eff_gest, aes(x = x, y = predicted)) +
  geom_line(color = "blue", linewidth = 1) +
  geom_ribbon(aes(ymin = conf.low, ymax = conf.high), 
              fill = "blue", alpha = 0.2) +
  labs(x = "Gestation Weeks",
       y = "Predicted Birth Weight (grams)",
       title = "Effect of Gestation Weeks on Predicted Birth Weight") +
  theme_minimal()

As we would expect the weight increases wit the number of the gestation weeks, and we can also notice that we have a greater variability in low and high values, hence (as we could see also in the previous plot) the “extreme” values are very difficult to estimate. This is due to low and high values which are less represented in the dataset and also because in these areas the trend doesn’t appear to be linear.

Now we can see again the boxplot representing how the mother smoking feature affects the birth weight.

ggplot(data, aes(x = factor(Fumatrici), y = Peso, fill = factor(Fumatrici))) +
  geom_boxplot(alpha = 0.7) +
  scale_fill_manual(values = c("skyblue", "tomato"),
                    labels = c("Non-smoker", "Smoker")) +
  labs(x = "Mother Smoking Status", 
       y = "Birth Weight (grams)",
       title = "Birth Weight Distribution by Mother Smoking Status") +
  theme_minimal()

We can see that it seems that a smoker mother will give birth to slightly lighter baby, but as we have already seen in the previous sections this feature is not statistically relevant.

Conclusion

To conclude, I can say that given the dataset and information provided we are not able to build a 100% satisfactory regression model. The model itself is quite good in the prediction, but has some difficulties with extreme values. These extreme values, not well explained by the model, causes also many problems in the residuals analysis, giving many leverages values, outliers and causing heteroskedasticity. We could try using more complex models then the simple linear regression, or maybe the best solution would be to restrict the use of this model within certain ranges.