A Statistical Model to Predict Newborn Weight

Company context

Company: Neonatal Health Solutions

Objective: Develop a statistical model capable of accurately predicting newborn birth weight based on clinical variables collected from three hospitals. The project aims to improve the management of high-risk pregnancies, optimize hospital resource allocation, and ensure better neonatal health outcomes.

This initiative is positioned within a broader context of increasing focus on the prevention of neonatal complications. The ability to predict birth weight represents a crucial opportunity to enhance clinical planning and mitigate risks associated with problematic deliveries, such as premature births or low-birth-weight infants. Below are the key benefits this project will bring to the company and the healthcare sector:

Enhancement of Clinical Forecasting:
Birth weight is a key indicator of neonatal health. Having an accurate predictive model enables medical staff to intervene promptly in case of anomalies, thereby reducing perinatal complications such as respiratory distress or hypoglycemia.

Optimization of Hospital Resources:
Knowing in advance which newborns may require intensive care allows hospitals to efficiently organize human and technological resources. This leads to lower operational costs and improved planning for the use of neonatal intensive care units (NICUs).

Prevention and Identification of Risk Factors:
The model will highlight the factors that most significantly and negatively affect birth weight (such as maternal smoking, multiple pregnancies, or advanced maternal age). This information is invaluable for preventive measures and personalized pregnancy management, enabling proactive interventions in high-risk cases.

Evaluation of Hospital Practices:
Through comparative analysis across the three participating hospitals, the company will be able to identify potential differences in clinical outcomes, such as higher rates of cesarean deliveries in specific facilities. This allows for quality monitoring and the harmonization of clinical protocols across hospitals, leading to more consistent standards of care.

Support for Strategic Planning:
Data analysis and predictive insights can inform decision-making at both clinical and strategic levels. The company can leverage this information to implement new public health policies, ensuring a positive impact on neonatal morbidity and mortality rates.

R Code

This section contains all the code and the functions used to conduct the analysis and implement the predictive model

library('ggplot2')
library('gghalves')
library('gridExtra')
library('dplyr')
library('moments')
library('car')
library('MASS')
library('lmtest')

data_set = read.csv("neonati.csv",
                    stringsAsFactors = T)

# number of data_set rows
n = nrow(data_set)

# quantitative continuos variable analysis
box_plot_violin = function(data_set, variable, color, title, axis_description) {
  ggplot(data=data_set,
       aes(x = "", y = .data[[variable]]))+
  geom_half_boxplot(side = 'l',
                    fill = color,
                    alpha = 0.7,) +
  
  geom_half_violin(side = 'r',
                   fill = color,
                   alpha = 0.3,) +
  labs(
      title = title,
      x = "",
      y = axis_description
    ) +
  theme_minimal()
}

# generate a box plot with a density distribution for the following quantitative continuos variables:
# Anni.madre
# Gestazione
# Cranio
# Lunghezza
# Peso

distr_anni_madre_plot = box_plot_violin(data_set = data_set,
                                   variable = "Anni.madre",
                                   color = "lightsalmon",
                                   title = "Anni.madre",
                                   axis_description = "Age")

distr_gestazione_plot = box_plot_violin(data_set = data_set,
                                   variable = "Gestazione",
                                   color = "yellow",
                                   title = "Gestazione",
                                   axis_description = "weeks")

distr_craio_plot = box_plot_violin(data_set = data_set,
                                   variable = "Cranio",
                                   color = "violet",
                                   title = "Cranio",
                                   axis_description = "mm")

distr_lunghezza_plot = box_plot_violin(data_set = data_set,
                                   variable = "Lunghezza",
                                   color = "lightgreen",
                                   title = "Lunghezza",
                                   axis_description = "mm")

distr_peso_plot = box_plot_violin(data_set = data_set,
                                   variable = "Peso",
                                   color = "lightblue",
                                   title = "Peso",
                                   axis_description = "g")

# generate a bar plot for the following quantitative discrete variables:
# N. Gravidanze
bar_plot_graph = function(data_set, variable, color, x_axis_description, y_axis_description){
  ggplot(data = data_set, aes(x = factor(.data[[variable]]))) +
    geom_bar(aes(y = after_stat(count / sum(count))), 
             fill = color, color = color, width = 0.7) +
    scale_y_continuous(labels = scales::percent_format()) +
    labs(
      title = variable,
      x = x_axis_description,
      y = y_axis_description
    ) +
    theme_minimal()
}

distr_n_gravidanze_plot = bar_plot_graph(data_set = data_set,
                               variable = "N.gravidanze",
                               color = "steelblue",
                               x_axis_description = "Number of previous pregnancies",
                               y_axis_description = "Percentage of mothers")

# generate a stacked bar plot for the following qualitative variable
# Fumatrici
# Tipo.parto
# Ospedale
# Sesso
stacked_bar_plot = function(data_set, variable, vcol, vlab, y_axis_description){
  ggplot(data = data_set,
       aes(x="",
           fill = factor(.data[[variable]])))+
    geom_bar(position = "fill",
           stat = "count",
           alpha = 0.8) +
    scale_y_continuous(labels = scales::percent_format()) +
    scale_fill_manual(
    values = vcol,
    labels = vlab
  ) +
  labs(
    x = variable,
    y = y_axis_description
  ) +
  theme_minimal()+
  theme(
    legend.position = "bottom",
    legend.title = element_blank()
  )
}


fumatrici_plot = stacked_bar_plot(
  data_set = data_set,
  variable = "Fumatrici",
  vcol = c("green", "red"),
  vlab = c("No", "Yes"),
  y_axis_description = "%"
)

tipo_parto_plot = stacked_bar_plot(
  data_set = data_set,
  variable = "Tipo.parto",
  vcol = c("red", "green"),
  vlab = c("Cesarean", "Natural"),
  y_axis_description = "%"
)

ospedale_plot = stacked_bar_plot(
  data_set = data_set,
  variable = "Ospedale",
  vcol = c("blue", "yellow","violet"),
  vlab = c("osp1", "osp2", "osp3"),
  y_axis_description = "%"
)

sesso_plot = stacked_bar_plot(
  data_set = data_set,
  variable = "Sesso",
  vcol = c("lightblue", "salmon"),
  vlab = c("M", "F"),
  y_axis_description = "%"
)

# this plot highlights the diffrences betweeen the delivery type through the three
# hospital
delivey_comparison_hospital = ggplot(data_set,
                                     aes(x = Ospedale, fill = Tipo.parto)) +
  geom_bar(position = "fill", color = "white") +
  scale_y_continuous(labels = scales::percent_format()) +
  scale_fill_manual(
    values = c("Nat" = "green", "Ces" = "red"),
    labels = c("Nat" = "Natural", "Ces" = "Cesarean")
  ) +
  labs(
    title = "Proportion of Delivery Type by Hospital",
    x = "Hospital",
    y = "Percentage",
    fill = "Delivery Type"
  ) +
  theme_minimal() +
  theme(
    legend.position = "bottom"
  )

# This dataframe store the population mean of Peso and Lunghezza 
# Reference link: https://www.ospedalebambinogesu.it/da-0-a-30-giorni-come-si-presenta-e-come-cresce-80012/
get_pop_peso_lunghezza =  function(){
  Peso_mu_0 = 3300 #[g]
  Lunghezza_mu_0 = 500 #[mm]
  
  return(
    data.frame(
      Peso_mu_0,
      Lunghezza_mu_0
    )
  )
}

pop_peso_lunghezza= get_pop_peso_lunghezza()

# function to create a box plot by a specified category for a specified variable
# the fuction will be used to create box plot for the following variables
# - Peso
# - Lunghezza
# - Cranio
# Using the category Sesso

box_plot_variable_by_cat = function(data_set, cat, variable, color, title, ylab){
  ggplot(data = data_set)+
  geom_boxplot(aes(x = .data[[cat]], y= .data[[variable]],
                   fill = as.factor(.data[[cat]])),
               fill = color,
               alpha = 0.7)+
  
  labs(
    title = title,
    x = cat,
    y = ylab
  )+
  theme_minimal()+
  theme(
    legend.title=element_blank(),
    legend.position="bottom"
    )
}

# Peso comparison per Sesso
peso_by_sesso_box_plot = box_plot_variable_by_cat(
  data_set = data_set,
  cat = "Sesso",
  variable = "Peso",
  color = "lightblue",
  title = "Peso per Sesso",
  ylab = "Peso [g]"
)

# Peso comparison per Ospedale
peso_by_ospedale_box_plot = box_plot_variable_by_cat(
  data_set = data_set,
  cat = "Ospedale",
  variable = "Peso",
  color = "lightblue",
  title = "Peso per Ospedale",
  ylab = "Peso [g]"
)

# Peso comparison per Tipo.parto
peso_by_tipo_parto_box_plot = box_plot_variable_by_cat(
  data_set = data_set,
  cat = "Tipo.parto",
  variable = "Peso",
  color = "lightblue",
  title = "Peso per Tipo.parto",
  ylab = "Peso [g]"
)

# Lunghezza comparison per Sesso
lunghezza_by_sesso_box_plot = box_plot_variable_by_cat(
  data_set = data_set,
  cat = "Sesso",
  variable = "Lunghezza",
  color = "lightgreen",
  title = "Lunghezza per Sesso",
  ylab = "Lunghezza [mm]"
)

# Lunghezza comparison per Sesso
cranio_by_sesso_box_plot = box_plot_variable_by_cat(
  data_set = data_set,
  cat = "Sesso",
  variable = "Cranio",
  color = "violet",
  title = "Cranio per Sesso",
  ylab = "Cranio [mm]"
)

# function to calculated variability and kutosis indexes of Peso
skew_kurt = function(x){
  return(cbind(Fisher = skewness(x), Kutosis = kurtosis(x) - 3))
}

# create a function that plot the correlation matrix
panel.cor <- function(x, y, digits = 2, prefix = "", cex.cor, ...)
{
  par(usr = c(0, 1, 0, 1))
  r <- abs(cor(x, y))
  txt <- format(c(r, 0.123456789), digits = digits)[1]
  txt <- paste0(prefix, txt)
  if(missing(cex.cor)) cex.cor <- 0.8/strwidth(txt)
  text(0.5, 0.5, txt, cex = 1.0)
}

correlation_matrix_plot = function(x){
  return(pairs(x,
               upper.panel = panel.smooth,
               lower.panel = panel.cor))
}

# this function create a table to show BIC for the specified model
models_comparison = function(model_1, model_2, perform_anova_test = T) {
  # Caluculate ANOVA F-test
  if (perform_anova_test == T){
    anova_res = anova(model_2, model_1)
    cat("ANOVA F-test between the models:\n")
    print(anova_res)
    cat("\n")
  }
  
  # Calculate BIC and AIC values
  bic_values = c(BIC(model_1), BIC(model_2))
  aic_values = c(AIC(model_1), AIC(model_2))
  bic_results <- data.frame(
    Model = c("Previous Model", "New Model"),
    BIC = bic_values,
    AIC = aic_values
  )
  
  cat("Models BIC and AIC:\n")
  print(bic_results)
  
  invisible(NULL)
}

# plot leverage points of a model
leverage_plot = function(model, number_of_rows){
  lev = hatvalues(mod_6)
  #plot(lev)
  p = sum(lev)
  # calculate the threshold
  threshold = 2*p/number_of_rows
  #abline(h=threshold, col=2)
  
  ggplot(data = NULL,
         aes(x = 1:number_of_rows, y = lev)) +
    geom_point(shape = 1,
               colour = "black") +
    geom_hline(yintercept = threshold,
               col = "red") +
    scale_shape_manual(values = 21:25) +
    scale_color_manual(values = c("black", "red")) +
    labs(x = "Index",
         y = "Leverage") +
    labs(
        title = "Leverage points")+
    theme_minimal()
}

# outliers model analysis
outliers_analysis = function(model){
  plot(rstudent(model),
       main = "Studentized residuals",
       ylab = "rstudent",
       xlab = "Index")
  # thresholds
  abline(h=c(-2,2), col= 2)
  
  # t-test
  print(outlierTest(model))
  
  # plots cook's distances
  c.dist=cooks.distance(model)
  plot(c.dist,
       main = "Cook's distance",
       ylab = "Cook's D",
       xlab = "Index")
  print('max cook distance')
  print(max(c.dist))
  
  # Residual density
  plot(density(residuals(model)),
       main = "Residual density",
       xlab = "Residuals")
}

# plot model graphical evaluations
model_graphs_plot = function(data_set, var, resp, cat, title){
  ggplot(data=data_set)+
  geom_point(aes(
    x= .data[[var]],
    y= .data[[resp]],
    col = .data[[cat]]
  ),
  alpha = 0.3,
  position = "jitter")+
  geom_smooth(
    aes(x = Gestazione, y = .data[[resp]], group = .data[[cat]]),
    colour = "white",
    linewidth = 1.5,
    se = FALSE,
    method = "lm"
  ) +
  geom_smooth(aes(
    x= .data[[var]],
    y= .data[[resp]],
    col = .data[[cat]]
  ),
  se = F,
  method = 'lm')+
  labs(
    title = title
  )+
  theme_minimal()+
  theme(
    legend.position = "bottom",
    legend.title = element_blank()
  )
}

Project Details

1. Data collections and Dataset structure

To build the predictive model, data were collected on 2,500 newborns from three hospitals. The dataset includes the following variables:

Età della madre (Mother’s age): measured in years.
Numero di gravidanze (Number of pregnancies): total number of pregnancies the mother has had.
Fumo materno (Maternal smoking): binary indicator (0 = non-smoker, 1 = smoker)
Durata della gravidanza (Gestational duration): number of weeks of gestation.
Peso del neonato (Birth weight): neonatal weight at birth, measured in grams.
Lunghezza e diametro del cranio (Length and head circumference): neonatal body length and cranial diameter, which can also be measured during pregnancy via ultrasound.
Tipo di parto (Type of delivery): natural (vaginal) or cesarean delivery.
Ospedale di nascita (Birth hospital): Hospital 1, 2, or 3.
Sesso del neonato (Newborn’s sex): male (M) or female (F).

The primary objective is to identify which of these variables are the most predictive of birth weight, with a specific focus on the impact of maternal smoking and gestational duration, as these factors may indicate a higher likelihood of premature birth.

2.1 Analysis and Modelling

The summary of the statististics of the dataset vaiables show the following information:

summary(data_set)

##    Anni.madre     N.gravidanze       Fumatrici        Gestazione   
##  Min.   : 0.00   Min.   : 0.0000   Min.   :0.0000   Min.   :25.00  
##  1st Qu.:25.00   1st Qu.: 0.0000   1st Qu.:0.0000   1st Qu.:38.00  
##  Median :28.00   Median : 1.0000   Median :0.0000   Median :39.00  
##  Mean   :28.16   Mean   : 0.9812   Mean   :0.0416   Mean   :38.98  
##  3rd Qu.:32.00   3rd Qu.: 1.0000   3rd Qu.:0.0000   3rd Qu.:40.00  
##  Max.   :46.00   Max.   :12.0000   Max.   :1.0000   Max.   :43.00  
##       Peso        Lunghezza         Cranio    Tipo.parto Ospedale   Sesso   
##  Min.   : 830   Min.   :310.0   Min.   :235   Ces: 728   osp1:816   F:1256  
##  1st Qu.:2990   1st Qu.:480.0   1st Qu.:330   Nat:1772   osp2:849   M:1244  
##  Median :3300   Median :500.0   Median :340              osp3:835           
##  Mean   :3284   Mean   :494.7   Mean   :340                                 
##  3rd Qu.:3620   3rd Qu.:510.0   3rd Qu.:350                                 
##  Max.   :4930   Max.   :565.0   Max.   :390

To perform an optimal description the variable analysis will be separated by type as follow.

Quantitative continuous variables - Preliminary analysis

grid.arrange(distr_anni_madre_plot,
             distr_gestazione_plot,
             distr_craio_plot,
             distr_lunghezza_plot,
             distr_peso_plot,
             nrow = 2,
             ncol = 3)

Variable	Remarks
Anni.madre	Quantitative discrete variable in interval scale treated as continuos. The observed outliers close to zero are likely attributable to data-entry errors and should be verified. Such values may act as high-leverage points, potentially affecting model stability.
Gestazione	Quantitative discrete variable in ratio scale treated as continuos. It is evident that most births occur during the normal gestational period, that is, after the 37th week of pregnancy.
Cranio	Quantitative continuous variable in ratio scale. The outliers observed in the lower tail of the distribution may indicate preterm births, as smaller head circumference values are often associated with reduced gestational duration.
Lunghezza	Quantitative continuous variable in ratio scale. The same considerations of the variable Cranio are valid.
Peso	Quantitative continuous variable in ratio scale. The same consideration of the variables Cranio and Lunghezza are valid. In add, a visual inspection of its distribution suggests that Peso approximates a normal (Gaussian) distribution, with a symmetric shape centered around the mean. However, this assumption will need to be formally verified through appropriate normality tests in the subsequent analysis.

Regarding

Quantitative discrete variables - Preliminary analysis

distr_n_gravidanze_plot

Variable	Remarks
N.gravidanze	Quantitative discrete variable in ratio scale The average value close to 1 meaning they have had no or only one previous pregnancy.

N.gravidanze

Quantitative discrete variable in ratio scale

The average value close to 1 meaning they have had no or only one previous pregnancy.

Qualitative variables - Preliminary analysis

grid.arrange(fumatrici_plot,
             tipo_parto_plot,
             ospedale_plot,
             sesso_plot,
             nrow = 2,
             ncol = 2)

Variable	Remarks
Fumatrici	Qualitative binary variable in nominal scale. 0 (No) = non-smoker, 1 (Yes) = smoker The majority of mothers no smoke during the pregnancy.
Tipo.parto	Qualitative variable on nominal scale. The majority of births were natural deliveries, whereas cesarean sections represented a smaller fraction of the sample.
Ospedale	Qualitative variable on nominal scale. The data are distributed among three hospitals (osp1, osp2, osp3). This balanced distribution indicates that the sample is well diversified across the participating institutions, reducing potential bias associated with single-hospital data collection.
Sesso	Qualitative variable on nominal scale. The data show an almost equal distribution between male and female infants, suggesting a well-balanced sample with respect to gender. This balance sudgest that subsequent analyses are not affected by sex-related sampling bias.

Data Cleaning

Following the exploratory descriptive analysis, a data cleaning procedure is carried out to correct data inconsistencies, handle outliers, and recode variables with non-normal or highly skewed distributions following this points:

Remove from the dataset the observations where Anni.madre equals 0 or 1, as identified by the graphical analysis.

data_set = data_set %>%
  filter(
    Anni.madre > 1
  )
# update the number of rows
n = nrow(data_set)

Convert Fumatrici into a factor variable.

data_set$Fumatrici <- as.factor(data_set$Fumatrici)

Create a categorical classification of N.gravidanze. This variable exhibits a markedly skewed and non-normal distribution. Consequently, it is recoded into three categorical classes:
- [0, 1): no previous pregnancies.
- [1, 2): 1 previous pregnancies.
- [2+): 2 or more previous pregnancies.

adding a new column to the dataset: “N.gravidanze_class” :

N.gravidanze_class = cut(data_set$N.gravidanze,
                         breaks = c(0, 1, 2, Inf),
                         include.lowest = T,
                         right = F,
                         labels = c("[0,1)", "[1,2)", "[2+)"))

data_set['N.gravidanze_class'] = N.gravidanze_class

N.gravidanze_class_plot = bar_plot_graph(data_set = data_set,
                               variable = "N.gravidanze_class",
                               color = "steelblue",
                               x_axis_description = "Number of previous pregnancies",
                               y_axis_description = "Percentage of mothers")
N.gravidanze_class_plot

2.1 Cesarean sections are being performed more often in some hospitals

The following summarization shows the percentage fraction of delivery type (Ces and Nat) among three hospitals.

round(prop.table(table(data_set$Ospedale, data_set$Tipo.parto), 1) * 100,2)

##       
##          Ces   Nat
##   osp1 29.66 70.34
##   osp2 29.95 70.05
##   osp3 27.82 72.18

delivey_comparison_hospital

The differences among three hospitals are very slight, at around 2%, which suggests that they all follow the same medical standards regarding pregnancy.

To asses this a Chi-square test with a significance level of 5% is performed.

chisq.test(table(data_set$Ospedale, data_set$Tipo.parto))

## 
##  Pearson's Chi-squared test
## 
## data:  table(data_set$Ospedale, data_set$Tipo.parto)
## X-squared = 1.083, df = 2, p-value = 0.5819

The test confirms the assumptions above. The \(H_0\) hypotesis of indipendence is not reject\((p-value > 0.05)\) indicating that the proportion of cesarean deliveries does not differ significantly among hospitals.

2.2 The mean weight and length of this sample of newborns are significantly equal to those of the population

The following output shows the population mean values of Peso and Lughezza (https://www.ospedalebambinogesu.it/da-0-a-30-giorni-come-si-presenta-e-come-cresce-80012/).

pop_peso_lunghezza

##   Peso_mu_0 Lunghezza_mu_0
## 1      3300            500

The following plots provide a detailed visualization of Peso and Lunghezza.

grid.arrange(distr_peso_plot,
             distr_lunghezza_plot,
             nrow = 1,
             ncol = 2)

Given the large sample size (n ≈ 2500), the Central Limit Theorem allows the assumption that the sampling distributions of Peso and Lunghezza are approximately normal.
This assumption is further supported by the results of the Shapiro–Wilk test, which did not indicate significant departures from normality, as well as by the graphical representations obtained in the descriptive analysis, where the density plots and box plots show approximately symmetric and bell-shaped distributions.

shapiro.test(data_set$Peso)

## 
##  Shapiro-Wilk normality test
## 
## data:  data_set$Peso
## W = 0.97068, p-value < 2.2e-16

shapiro.test(data_set$Lunghezza)

## 
##  Shapiro-Wilk normality test
## 
## data:  data_set$Lunghezza
## W = 0.90944, p-value < 2.2e-16

Infact, the null hypothesis \(H_0\) which assumes a normal distribution, is not rejected \(p-value < 0.05\).

By applying a two-sided t-test to each variable and setting the significance level at 5%, it is possible to assess whether each sample mean differs significantly from the corresponding population value.

t.test(data_set$Peso,
       mu = pop_peso_lunghezza$Peso_mu_0,
       alternative = "two.sided")

## 
##  One Sample t-test
## 
## data:  data_set$Peso
## t = -1.505, df = 2497, p-value = 0.1324
## alternative hypothesis: true mean is not equal to 3300
## 95 percent confidence interval:
##  3263.577 3304.791
## sample estimates:
## mean of x 
##  3284.184

The \(H_0\) hypotesis that say that the sample mean of Peso not differ from the Population mean is not reject\((p-value > 0.05)\) indicating that the difference is not statistically significantly .

t.test(data_set$Lunghezza,
       mu = pop_peso_lunghezza$Lunghezza_mu_0,
       alternative = "two.sided")

## 
##  One Sample t-test
## 
## data:  data_set$Lunghezza
## t = -10.069, df = 2497, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 500
## 95 percent confidence interval:
##  493.6628 495.7287
## sample estimates:
## mean of x 
##  494.6958

The \(H_0\) hypotesis that say that the sample mean of Lunghezza not differ from the Population mean is reject\((p-value < 0.05)\) indicating that the difference is statistically significantly.

These findings suggest that the sampled neonates are representative of the general population in terms of birth weight, but slightly smaller in terms of body length.

2.3 Anthropometric measurements differ significantly between the two sexes

The following plots provide a detailed visualization of the anthropometric variables (Peso, Lunghezza, and Cranio).

grid.arrange(distr_peso_plot,
             peso_by_sesso_box_plot,
             nrow = 1,
             ncol = 2)

grid.arrange(distr_lunghezza_plot,
             lunghezza_by_sesso_box_plot,
             nrow = 1,
             ncol = 2)

grid.arrange(distr_craio_plot,
             cranio_by_sesso_box_plot,
             nrow = 1,
             ncol = 2)

Descriptive analysis shows that male newborns generally exhibit higher anthropometric measurements than female newborns.
To verify whether these differences are statistically significant, a pairwise t-test is performed, after validating the assumptions of normality and homoscedasticity.

As per previouse analysis on Peso and Lunghezza, given the large sample size (n ≈ 2500), the Central Limit Theorem allows the assumption that the sampling distributions of Cranio is approximately normal.
This assumption is further supported by the results of the Shapiro–Wilk test, which did not indicate significant departures from normality, as well as by the graphical representations obtained in the descriptive analysis, where the density plots and boxplots show approximately symmetric and bell-shaped distributions.

shapiro.test(data_set$Cranio)

## 
##  Shapiro-Wilk normality test
## 
## data:  data_set$Cranio
## W = 0.96358, p-value < 2.2e-16

Infact, the null hypothesis \(H_0\) which assumes a normal distribution, is not rejected (p-value <= 0.05).

The assumption of homoscedasticity, that is, the equality of variances between groups, is verified using Levene’s test.
This procedure assesses whether the variability of the anthropometric measures differs significantly between sexes, a necessary condition for the appropriate application of the t-test.

leveneTest(Peso ~ Sesso, data = data_set)

## Levene's Test for Homogeneity of Variance (center = median)
##         Df F value Pr(>F)
## group    1  0.8222 0.3646
##       2496

leveneTest(Lunghezza ~ Sesso, data = data_set)

## Levene's Test for Homogeneity of Variance (center = median)
##         Df F value   Pr(>F)   
## group    1  10.571 0.001164 **
##       2496                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

leveneTest(Cranio ~ Sesso, data = data_set)

## Levene's Test for Homogeneity of Variance (center = median)
##         Df F value Pr(>F)
## group    1  1.2098 0.2715
##       2496

The varaibles Peso and Cranio satisfy both the assumptions of normality and homoscedasticity allowing the use of a two sample t-tests with a fixed significance level \(α = 0.05\).

t.test(Peso ~ Sesso,
       data = data_set,
       var.equal = T,
       alternative = "two.sided")

## 
##  Two Sample t-test
## 
## data:  Peso by Sesso
## t = -12.111, df = 2496, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group F and group M is not equal to 0
## 95 percent confidence interval:
##  -287.4963 -207.3721
## sample estimates:
## mean in group F mean in group M 
##        3161.061        3408.496

t.test(Cranio ~ Sesso,
       data = data_set,
       var.equal = T,
       alternative = "two.sided")

## 
##  Two Sample t-test
## 
## data:  Cranio by Sesso
## t = -7.4344, df = 2496, p-value = 1.436e-13
## alternative hypothesis: true difference in means between group F and group M is not equal to 0
## 95 percent confidence interval:
##  -6.110877 -3.560044
## sample estimates:
## mean in group F mean in group M 
##        337.6231        342.4586

For both variables the p-value is very small, very close to 0, therefore \(H_0\) hypothesis is rejected and it is concluded that the mean values differ significantly between male and female newborns for the variables under analysis.

Regarding the variable Lunghezza, it satisfy the normality assumption but not that of homoscedasticity. Consequently, it is not appropriate to apply the classic two sample t-test but a variant that relax the homoscedasticity assumption, the Welch’s t-test.

t.test(Lunghezza ~ Sesso,
       data = data_set,
       var.equal = F,
       alternative = "two.sided")

## 
##  Welch Two Sample t-test
## 
## data:  Lunghezza by Sesso
## t = -9.5823, df = 2457.3, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group F and group M is not equal to 0
## 95 percent confidence interval:
##  -11.939001  -7.882672
## sample estimates:
## mean in group F mean in group M 
##        489.7641        499.6750

The p-value is very small, very close to 0, therefore \(H_0\) hypothesis is rejected and it is concluded that the mean values differ significantly between male and female newborns for the variables under analysis.

2.2 Creation of the Regression Model

The output variable Peso, as verified in the previous section, approximately follow a normal distribution. In addition, the skewness and kurtosis indices are calculated:

skew_kurt(data_set$Peso)

##          Fisher  Kutosis
## [1,] -0.6474036 2.028753

The distribution appears slightly right-skewedand leptokurtic, as confirmed by the skewness and kurtosis coefficients.

This indicates that it has heavier tails and sharper peak than the normal distribution; however, it can still be reasonably approximated by a normal distribution.

Regressors analysis: Quantitative variables impact

The correlation of the input variables with Peso is analyzed.

correlation_matrix_plot(data_set)

From the analysis of the quantitative variables, it emerges that the birth weight (Peso) of a newborn is strongly correlated with:

Length (Lunghezza), with a correlation coefficient of 0.80.
Head circumference (Cranio), with a correlation coefficient of 0.70;

as shows in the scatter plots above.

Additionally, these two variables have a weak correlation 0.60 and should be not a problem for the linear moldel.

Gestational duration (Gestazione) shows a correlation coefficient of 0.59. Even if the relationship does not appears stricltly linear, it could be included in the model, as the pattern seems to follow a logarithmic trend, where birth weight (Peso) increases rapidly with gestational weeks before reaching a plateau.

Regressors analysis: Qualitative variables impact

The qualitative variables are treated differently, as it is difficult to extract meaningful information from them using a correlation matrix. It is only possible to observe that the data points are distributed across the various classes, without providing any clear quantitative relationship.

In the section “2.3 Anthropometric measurements differ significantly between the two sexes” has been demonstrated that Sesso exerts a statistically significant influence on Peso, leading to an expectedly relevant regression coefficient.
The subsequent step consists of testing the other qualitative predictors in the same way, to determine whether they also contribute significantly to the explanation of the newborn’s weight.

The following plots provide a detailed visualization of Ospedale:

grid.arrange(
  distr_peso_plot,
  peso_by_ospedale_box_plot,
  nrow = 1,
  ncol = 2
)

The distribution of Peso across the three hospitals appears very similar, suggesting that there are no statistically significant deviations in newborn weight among the hospitals.
The assumptions above for mean comparison are verified using a pairwise t-test, assessing both the normality and homoscedasticity hypotheses.
As the normality assumption for Peso was already confirmed in the previous section, it is only necessary to verify the homoscedasticity assumption by performing the Levene’s tes t.

leveneTest(Peso ~ Ospedale, data = data_set)

## Levene's Test for Homogeneity of Variance (center = median)
##         Df F value Pr(>F)
## group    2  0.5912 0.5537
##       2495

The Levene’s test indicate that the null hypotesys of equal variance \(H_0\) cannot be rejected \((p-value > 0.05)\).
Therefore, it can be assumed that the variances of Peso are homogeneous across the three hospitals, satisfying the homoscedasticity assumption required for the t-test.

The pairwise t-test concludes:

pairwise.t.test(data_set$Peso,
                data_set$Ospedale,
                p.adjust.method = 'bonferroni',
                paired = F,
                pool.sd = T)

## 
##  Pairwise comparisons using t tests with pooled SD 
## 
## data:  data_set$Peso and data_set$Ospedale 
## 
##      osp1 osp2
## osp2 1.00 -   
## osp3 0.33 0.32
## 
## P value adjustment method: bonferroni

The pairwise t-test, adjusted using the Bonferroni correction, shows that no pairwise comparison among the three hospitals is statistically significant \((p-value > 0.05)\).
Therefore, it can be concluded that there are no significant differences in the mean birth weight of newborns across the hospitals, confirming that the Peso variable behaves consistently among the three medical centers.

Consequently the corresponding regression coefficient \(\beta\) is not expected to be significant.

The Bonferroni correction is a conservative adjustment method used to control the type-I error when multiple hypothesis tests are performed simultaneously.
Specifically, it divides the overall significance level (α) by the number of comparisons (m) to obtain a more stringent threshold for significance.

After evaluating the influence of the Ospedale variable on Peso, the analysis proceeds with the assessment of the remaining qualitative predictors: Tipo.parto, Fumatrici and N.gravidanze_class.

Both Tipo.parto and Fumatrici are binary and potentially relevant for explaining variations in newborn birth weight.

The same consideration applies to N.gravidaze_class whose grouping may help highlight potential relationships with Peso.

The objective of this analysis is to determine whether the delivery type (Tipo.parto) and maternal smoking status (Fumatrici) have a statistically significant impact on newborn weight.

# Peso comparison per N.gravidanze_class
peso_by_N.gravidanze_class_box_plot = box_plot_variable_by_cat(
  data_set = data_set,
  cat = "N.gravidanze_class",
  variable = "Peso",
  color = "lightblue",
  title = "Peso per N.gravidanze_class",
  ylab = "Peso [g]"
)

# Peso comparison per Fumatrici
peso_by_fumatrici_box_plot = box_plot_variable_by_cat(
  data_set = data_set,
  cat = "Fumatrici",
  variable = "Peso",
  color = "lightblue",
  title = "Peso per Fumatrici",
  ylab = "Peso [g]"
)


grid.arrange(
  distr_peso_plot,
  peso_by_tipo_parto_box_plot,
  nrow = 1,
  ncol = 2
)

grid.arrange(
  distr_peso_plot,
  peso_by_fumatrici_box_plot,
  nrow = 1,
  ncol = 2
)

grid.arrange(
  distr_peso_plot,
  peso_by_N.gravidanze_class_box_plot,
  nrow = 1,
  ncol = 2
)

From the graphical analysis, it appears that neither the type of delivery (Tipo.parto) nor maternal smoking (Fumatrici) nor N.gravidanze_class have a significant effect on the weight of the newborn (Peso).

The assumption above needs to be statistically verified using a two sample t-test validating before the homoscedasticity assumptions:

leveneTest(Peso ~ Tipo.parto, data = data_set)

## Levene's Test for Homogeneity of Variance (center = median)
##         Df F value  Pr(>F)  
## group    1  4.5678 0.03267 *
##       2496                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

leveneTest(Peso ~ Fumatrici, data = data_set)

## Levene's Test for Homogeneity of Variance (center = median)
##         Df F value Pr(>F)
## group    1  1.4267 0.2324
##       2496

leveneTest(Peso ~ N.gravidanze_class, data = data_set)

## Levene's Test for Homogeneity of Variance (center = median)
##         Df F value Pr(>F)
## group    2  1.3904 0.2492
##       2495

The Levene’s test indicates that the homoscedasticity assumption holds for Fumatrici and N.gravidanze_class but not for Tipo.parto.

This means that the variance of Peso is homogeneous between smoking groups but significantly different between the two delivery types.

In this case the analysis will proceed applying the standard two sample t- test for Fumatrici and the Welch’s t-test for Tipo.parto.

t.test(Peso ~ Fumatrici,
       data = data_set,
       var.equal = T,
       alternative = "two.sided")

## 
##  Two Sample t-test
## 
## data:  Peso by Fumatrici
## t = 0.94878, df = 2496, p-value = 0.3428
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
##  -53.24919 153.08153
## sample estimates:
## mean in group 0 mean in group 1 
##        3286.262        3236.346

t.test(Peso ~ Tipo.parto,
       data = data_set,
       var.equal = F,
       alternative = "two.sided")

## 
##  Welch Two Sample t-test
## 
## data:  Peso by Tipo.parto
## t = -0.13626, df = 1494.4, p-value = 0.8916
## alternative hypothesis: true difference in means between group Ces and group Nat is not equal to 0
## 95 percent confidence interval:
##  -46.44246  40.40931
## sample estimates:
## mean in group Ces mean in group Nat 
##          3282.047          3285.063

For the variable N.gravidanze_class a paired t-test, with Bonferroni correction, is applied to evaluate whether there are statistically significant differences in weight between the different classes of number of pregnancies.

pairwise.t.test(data_set$Peso,
                data_set$N.gravidanze_class,
                p.adjust.method = 'bonferroni',
                paired = F,
                pool.sd = T)

## 
##  Pairwise comparisons using t tests with pooled SD 
## 
## data:  data_set$Peso and data_set$N.gravidanze_class 
## 
##       [0,1) [1,2)
## [1,2) 1.00  -    
## [2+)  0.93  0.79 
## 
## P value adjustment method: bonferroni

The test results indicate that the null hypothesis \(H_0\) cannot be rejected, meaning that there are no statistically significant differences in the birth weight of newborns based on either maternal smoking status, delivery type (caesarean versus natural) or Number of previous pregnancies.

Consequently the corresponding regression coefficients \(\beta\) are not expected to be significant.

Therefore, the grouped version of the variable is not retained in the final model, and the original numeric form of N.gravidanze is used instead.

data_set =  subset(data_set, select = -N.gravidanze_class)

At this stage a multiple regression model is constructed, initially incorporating all predictor variable to asses their joint influence on Peso.

mod_1 = lm(data_set$Peso~ ., data=data_set)
summary(mod_1)

## 
## Call:
## lm(formula = data_set$Peso ~ ., data = data_set)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1123.26  -181.53   -14.45   161.05  2611.89 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -6735.7960   141.4790 -47.610  < 2e-16 ***
## Anni.madre        0.8018     1.1467   0.699   0.4845    
## N.gravidanze     11.3812     4.6686   2.438   0.0148 *  
## Fumatrici1      -30.2741    27.5492  -1.099   0.2719    
## Gestazione       32.5773     3.8208   8.526  < 2e-16 ***
## Lunghezza        10.2922     0.3009  34.207  < 2e-16 ***
## Cranio           10.4722     0.4263  24.567  < 2e-16 ***
## Tipo.partoNat    29.6335    12.0905   2.451   0.0143 *  
## Ospedaleosp2    -11.0912    13.4471  -0.825   0.4096    
## Ospedaleosp3     28.2495    13.5054   2.092   0.0366 *  
## SessoM           77.5723    11.1865   6.934 5.18e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 274 on 2487 degrees of freedom
## Multiple R-squared:  0.7289, Adjusted R-squared:  0.7278 
## F-statistic: 668.7 on 10 and 2487 DF,  p-value: < 2.2e-16

The full regression model explains about the 73% of the variance in Peso (\(R^2=0.7278\)) indicating a strong overall fit.

The F-statistic is highly significant (p < 0.05), confirming the global validity of the model.

The most significatve predictors are Gestazione, Lunghezza, and Cranio, all positively associated with birth weight. Sesso, Tipo.parto and Ospedale are also significant, but with smaller contributions.

Anni.madre and Fumatrici show no significant effect, suggesting that they may be excluded in subsequent model refinements.

The next phase involves the iterative removal of non-significant predictors, starting with Anni.madre.

For each variable removed, an ANOVA F-test will be conducted to assess whether its exclusion leads to a statistically significant loss of explanatory power.

The process will continue recursively until the most parsimonious model with the best explanatory performance is achieved.

mod_2 = update(mod_1, ~. -Anni.madre)
summary(mod_2)

## 
## Call:
## lm(formula = data_set$Peso ~ N.gravidanze + Fumatrici + Gestazione + 
##     Lunghezza + Cranio + Tipo.parto + Ospedale + Sesso, data = data_set)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1113.82  -180.30   -16.22   160.66  2616.32 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -6708.6189   136.0211 -49.320  < 2e-16 ***
## N.gravidanze     12.5833     4.3400   2.899  0.00377 ** 
## Fumatrici1      -30.4268    27.5455  -1.105  0.26944    
## Gestazione       32.2996     3.7997   8.501  < 2e-16 ***
## Lunghezza        10.2916     0.3008  34.209  < 2e-16 ***
## Cranio           10.4874     0.4257  24.638  < 2e-16 ***
## Tipo.partoNat    29.6654    12.0892   2.454  0.01420 *  
## Ospedaleosp2    -10.9509    13.4442  -0.815  0.41541    
## Ospedaleosp3     28.5171    13.4986   2.113  0.03474 *  
## SessoM           77.6452    11.1849   6.942 4.91e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 274 on 2488 degrees of freedom
## Multiple R-squared:  0.7288, Adjusted R-squared:  0.7279 
## F-statistic: 743.1 on 9 and 2488 DF,  p-value: < 2.2e-16

models_comparison(model_1 = mod_1, model_2 = mod_2)

## ANOVA F-test between the models:
## Analysis of Variance Table
## 
## Model 1: data_set$Peso ~ N.gravidanze + Fumatrici + Gestazione + Lunghezza + 
##     Cranio + Tipo.parto + Ospedale + Sesso
## Model 2: data_set$Peso ~ Anni.madre + N.gravidanze + Fumatrici + Gestazione + 
##     Lunghezza + Cranio + Tipo.parto + Ospedale + Sesso
##   Res.Df       RSS Df Sum of Sq      F Pr(>F)
## 1   2488 186779904                           
## 2   2487 186743194  1     36710 0.4889 0.4845
## 
## Models BIC and AIC:
##            Model      BIC      AIC
## 1 Previous Model 35215.45 35145.57
## 2      New Model 35208.12 35144.06

The ANOVA test demostreates that Anni.madre not reduce the power of the model to predict the newborn weight.

This new model mod_2 maintains the same capacity to explain about the 73% of the variance in Peso (\(R^2=0.7278\)) as mod_1.

The F-statistic is highly significant (p < 0.05), confirming the global validity of the model, and the BIC and AIC metrics are smaller than those of the previous model. It follows that the new model mod_2 is better than the previous one.

Additionally the variable N.gravidanze increase its significance level respect to mod_1.

mod_3 = update(mod_2, ~. -Fumatrici)
summary(mod_3)

## 
## Call:
## lm(formula = data_set$Peso ~ N.gravidanze + Gestazione + Lunghezza + 
##     Cranio + Tipo.parto + Ospedale + Sesso, data = data_set)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1113.07  -181.71   -16.66   161.08  2619.57 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -6707.9252   136.0257 -49.314  < 2e-16 ***
## N.gravidanze     12.3360     4.3344   2.846  0.00446 ** 
## Gestazione       32.0386     3.7925   8.448  < 2e-16 ***
## Lunghezza        10.3059     0.3006  34.286  < 2e-16 ***
## Cranio           10.4920     0.4257  24.648  < 2e-16 ***
## Tipo.partoNat    29.4080    12.0875   2.433  0.01505 *  
## Ospedaleosp2    -10.8939    13.4447  -0.810  0.41786    
## Ospedaleosp3     28.7917    13.4969   2.133  0.03301 *  
## SessoM           77.4657    11.1842   6.926 5.48e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 274 on 2489 degrees of freedom
## Multiple R-squared:  0.7287, Adjusted R-squared:  0.7278 
## F-statistic: 835.7 on 8 and 2489 DF,  p-value: < 2.2e-16

models_comparison(model_1 = mod_2, model_2 = mod_3)

## ANOVA F-test between the models:
## Analysis of Variance Table
## 
## Model 1: data_set$Peso ~ N.gravidanze + Gestazione + Lunghezza + Cranio + 
##     Tipo.parto + Ospedale + Sesso
## Model 2: data_set$Peso ~ N.gravidanze + Fumatrici + Gestazione + Lunghezza + 
##     Cranio + Tipo.parto + Ospedale + Sesso
##   Res.Df       RSS Df Sum of Sq      F Pr(>F)
## 1   2489 186871503                           
## 2   2488 186779904  1     91599 1.2201 0.2694
## 
## Models BIC and AIC:
##            Model      BIC      AIC
## 1 Previous Model 35208.12 35144.06
## 2      New Model 35201.52 35143.29

The ANOVA test demostreates that Fumatrici not reduce the power of the model to predict the newborn weight and the BIC and AIC metrics are smaller than those of the previous model. It follows that the new model mod_3 is better than the previous one.

This new model, mod_3, maintains the same capacity to explain about the 73% of the variance in Peso (\(R^2=0.7278\)) as mod_2.

mod_4 = update(mod_3, ~. -Ospedale)
summary(mod_4)

## 
## Call:
## lm(formula = data_set$Peso ~ N.gravidanze + Gestazione + Lunghezza + 
##     Cranio + Tipo.parto + Sesso, data = data_set)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1129.14  -181.97   -16.26   160.95  2638.18 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -6708.0171   136.0715 -49.298  < 2e-16 ***
## N.gravidanze     12.7356     4.3385   2.935  0.00336 ** 
## Gestazione       32.3253     3.7969   8.514  < 2e-16 ***
## Lunghezza        10.2833     0.3009  34.177  < 2e-16 ***
## Cranio           10.5063     0.4263  24.648  < 2e-16 ***
## Tipo.partoNat    30.1601    12.1027   2.492  0.01277 *  
## SessoM           77.9171    11.1994   6.957 4.42e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 274.4 on 2491 degrees of freedom
## Multiple R-squared:  0.7277, Adjusted R-squared:  0.727 
## F-statistic:  1109 on 6 and 2491 DF,  p-value: < 2.2e-16

models_comparison(model_1 = mod_3, model_2 = mod_4)

## ANOVA F-test between the models:
## Analysis of Variance Table
## 
## Model 1: data_set$Peso ~ N.gravidanze + Gestazione + Lunghezza + Cranio + 
##     Tipo.parto + Sesso
## Model 2: data_set$Peso ~ N.gravidanze + Gestazione + Lunghezza + Cranio + 
##     Tipo.parto + Ospedale + Sesso
##   Res.Df       RSS Df Sum of Sq      F   Pr(>F)   
## 1   2491 187574428                                
## 2   2489 186871503  2    702925 4.6812 0.009349 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Models BIC and AIC:
##            Model      BIC      AIC
## 1 Previous Model 35201.52 35143.29
## 2      New Model 35195.25 35148.67

Although Ospedale appeared to be slightly statistically significant in the model mod_3, it was removed because the previous analyses indicated that this variable does not have a meaningful influence on Peso.

The same reasoning also applies to Tipo.parto and N.gravidanze, which was therefore excluded from the model.

mod_5 = update(mod_4, ~. -Tipo.parto)
summary(mod_5)

## 
## Call:
## lm(formula = data_set$Peso ~ N.gravidanze + Gestazione + Lunghezza + 
##     Cranio + Sesso, data = data_set)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1149.37  -180.98   -15.57   163.69  2639.09 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -6681.7251   135.8036 -49.201  < 2e-16 ***
## N.gravidanze    12.4554     4.3416   2.869  0.00415 ** 
## Gestazione      32.3827     3.8008   8.520  < 2e-16 ***
## Lunghezza       10.2455     0.3008  34.059  < 2e-16 ***
## Cranio          10.5410     0.4265  24.717  < 2e-16 ***
## SessoM          77.9807    11.2111   6.956 4.47e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 274.7 on 2492 degrees of freedom
## Multiple R-squared:  0.727,  Adjusted R-squared:  0.7265 
## F-statistic:  1327 on 5 and 2492 DF,  p-value: < 2.2e-16

models_comparison(model_1 = mod_4, model_2 = mod_5)

## ANOVA F-test between the models:
## Analysis of Variance Table
## 
## Model 1: data_set$Peso ~ N.gravidanze + Gestazione + Lunghezza + Cranio + 
##     Sesso
## Model 2: data_set$Peso ~ N.gravidanze + Gestazione + Lunghezza + Cranio + 
##     Tipo.parto + Sesso
##   Res.Df       RSS Df Sum of Sq      F  Pr(>F)  
## 1   2492 188042054                              
## 2   2491 187574428  1    467626 6.2101 0.01277 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Models BIC and AIC:
##            Model      BIC      AIC
## 1 Previous Model 35195.25 35148.67
## 2      New Model 35193.65 35152.89

mod_6 = update(mod_5, ~. -N.gravidanze)
summary(mod_6)

## 
## Call:
## lm(formula = data_set$Peso ~ Gestazione + Lunghezza + Cranio + 
##     Sesso, data = data_set)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1138.1  -184.4   -17.4   163.3  2626.7 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -6651.6732   135.5952 -49.055  < 2e-16 ***
## Gestazione     31.3262     3.7884   8.269  < 2e-16 ***
## Lunghezza      10.2024     0.3009  33.909  < 2e-16 ***
## Cranio         10.6706     0.4247  25.126  < 2e-16 ***
## SessoM         79.1027    11.2205   7.050 2.31e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 275.1 on 2493 degrees of freedom
## Multiple R-squared:  0.7261, Adjusted R-squared:  0.7257 
## F-statistic:  1652 on 4 and 2493 DF,  p-value: < 2.2e-16

models_comparison(model_1 = mod_5, model_2 = mod_6)

## ANOVA F-test between the models:
## Analysis of Variance Table
## 
## Model 1: data_set$Peso ~ Gestazione + Lunghezza + Cranio + Sesso
## Model 2: data_set$Peso ~ N.gravidanze + Gestazione + Lunghezza + Cranio + 
##     Sesso
##   Res.Df       RSS Df Sum of Sq      F   Pr(>F)   
## 1   2493 188663107                                
## 2   2492 188042054  1    621053 8.2304 0.004154 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Models BIC and AIC:
##            Model      BIC      AIC
## 1 Previous Model 35193.65 35152.89
## 2      New Model 35194.06 35159.12

The model mod_6 appears to be the most appropriate specification.

Despite using a smaller number of predictors, it maintains nearly the same level of explained variance as the previous model about 73%, while showing an improved F-statistic with a p-value < 0.05 and a BIC value almost identical to that of the previous model mod_5.

This indicates that the reduced model achieves a more efficient balance between parsimony and explanatory power.

As previously explained the variable Gestazione increases, demostrating an increasing asymptotic effect that could be explained by including the logarithmic contribution of Gestazione in the model.

mod_7 = update(mod_6, ~. +I(log(Gestazione)))
summary(mod_7)

## 
## Call:
## lm(formula = data_set$Peso ~ Gestazione + Lunghezza + Cranio + 
##     Sesso + I(log(Gestazione)), data = data_set)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1141.78  -181.60   -14.94   162.19  2652.02 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         4651.2180  4258.5700   1.092 0.274850    
## Gestazione           148.5229    44.2956   3.353 0.000811 ***
## Lunghezza             10.3199     0.3038  33.975  < 2e-16 ***
## Cranio                10.7828     0.4263  25.296  < 2e-16 ***
## SessoM                76.6305    11.2455   6.814 1.18e-11 ***
## I(log(Gestazione)) -4360.2156  1641.9597  -2.655 0.007970 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 274.8 on 2492 degrees of freedom
## Multiple R-squared:  0.7269, Adjusted R-squared:  0.7263 
## F-statistic:  1326 on 5 and 2492 DF,  p-value: < 2.2e-16

models_comparison(model_1 = mod_6, model_2 = mod_7)

## ANOVA F-test between the models:
## Analysis of Variance Table
## 
## Model 1: data_set$Peso ~ Gestazione + Lunghezza + Cranio + Sesso + I(log(Gestazione))
## Model 2: data_set$Peso ~ Gestazione + Lunghezza + Cranio + Sesso
##   Res.Df       RSS Df Sum of Sq      F  Pr(>F)   
## 1   2492 188130750                               
## 2   2493 188663107 -1   -532357 7.0517 0.00797 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Models BIC and AIC:
##            Model      BIC      AIC
## 1 Previous Model 35194.06 35159.12
## 2      New Model 35194.83 35154.06

This contribution does not improve the model. The logarithmic contribution is less significant than the linear one; therefore, the mod_6 model is still considered the best.

To must to be sure to cover all the possibility a new model called mod_8 is created using the automatic step wise procedure that minimizes the BIC.

This criterion is preferred because it applies a stricter penalty to models with a greater number of parameters, ensuring simplicity without compromising explanatory accuracy.

mod_8 = stepAIC(mod_1, direction = 'both', k=log(n))

## Start:  AIC=28118.61
## data_set$Peso ~ Anni.madre + N.gravidanze + Fumatrici + Gestazione + 
##     Lunghezza + Cranio + Tipo.parto + Ospedale + Sesso
## 
##                Df Sum of Sq       RSS   AIC
## - Anni.madre    1     36710 186779904 28111
## - Fumatrici     1     90677 186833870 28112
## - Ospedale      2    687555 187430749 28112
## - N.gravidanze  1    446244 187189438 28117
## - Tipo.parto    1    451073 187194266 28117
## <none>                      186743194 28119
## - Sesso         1   3610705 190353899 28159
## - Gestazione    1   5458852 192202046 28183
## - Cranio        1  45318506 232061700 28654
## - Lunghezza     1  87861708 274604902 29074
## 
## Step:  AIC=28111.28
## data_set$Peso ~ N.gravidanze + Fumatrici + Gestazione + Lunghezza + 
##     Cranio + Tipo.parto + Ospedale + Sesso
## 
##                Df Sum of Sq       RSS   AIC
## - Fumatrici     1     91599 186871503 28105
## - Ospedale      2    693914 187473818 28105
## - Tipo.parto    1    452049 187231953 28110
## <none>                      186779904 28111
## - N.gravidanze  1    631082 187410986 28112
## + Anni.madre    1     36710 186743194 28119
## - Sesso         1   3617809 190397713 28151
## - Gestazione    1   5424800 192204704 28175
## - Cranio        1  45569477 232349381 28649
## - Lunghezza     1  87852027 274631931 29066
## 
## Step:  AIC=28104.68
## data_set$Peso ~ N.gravidanze + Gestazione + Lunghezza + Cranio + 
##     Tipo.parto + Ospedale + Sesso
## 
##                Df Sum of Sq       RSS   AIC
## - Ospedale      2    702925 187574428 28098
## - Tipo.parto    1    444404 187315907 28103
## <none>                      186871503 28105
## - N.gravidanze  1    608136 187479640 28105
## + Fumatrici     1     91599 186779904 28111
## + Anni.madre    1     37633 186833870 28112
## - Sesso         1   3601860 190473363 28145
## - Gestazione    1   5358199 192229702 28168
## - Cranio        1  45613331 232484834 28642
## - Lunghezza     1  88259386 275130889 29063
## 
## Step:  AIC=28098.41
## data_set$Peso ~ N.gravidanze + Gestazione + Lunghezza + Cranio + 
##     Tipo.parto + Sesso
## 
##                Df Sum of Sq       RSS   AIC
## - Tipo.parto    1    467626 188042054 28097
## <none>                      187574428 28098
## - N.gravidanze  1    648873 188223301 28099
## + Ospedale      2    702925 186871503 28105
## + Fumatrici     1    100610 187473818 28105
## + Anni.madre    1     44184 187530244 28106
## - Sesso         1   3644818 191219246 28139
## - Gestazione    1   5457887 193032315 28162
## - Cranio        1  45747094 233321522 28636
## - Lunghezza     1  87955701 275530129 29051
## 
## Step:  AIC=28096.81
## data_set$Peso ~ N.gravidanze + Gestazione + Lunghezza + Cranio + 
##     Sesso
## 
##                Df Sum of Sq       RSS   AIC
## <none>                      188042054 28097
## - N.gravidanze  1    621053 188663107 28097
## + Tipo.parto    1    467626 187574428 28098
## + Ospedale      2    726146 187315907 28103
## + Fumatrici     1     92548 187949505 28103
## + Anni.madre    1     45366 187996688 28104
## - Sesso         1   3650790 191692844 28137
## - Gestazione    1   5477493 193519547 28161
## - Cranio        1  46098547 234140601 28637
## - Lunghezza     1  87532691 275574744 29044

summary(mod_8)

## 
## Call:
## lm(formula = data_set$Peso ~ N.gravidanze + Gestazione + Lunghezza + 
##     Cranio + Sesso, data = data_set)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1149.37  -180.98   -15.57   163.69  2639.09 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -6681.7251   135.8036 -49.201  < 2e-16 ***
## N.gravidanze    12.4554     4.3416   2.869  0.00415 ** 
## Gestazione      32.3827     3.8008   8.520  < 2e-16 ***
## Lunghezza       10.2455     0.3008  34.059  < 2e-16 ***
## Cranio          10.5410     0.4265  24.717  < 2e-16 ***
## SessoM          77.9807    11.2111   6.956 4.47e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 274.7 on 2492 degrees of freedom
## Multiple R-squared:  0.727,  Adjusted R-squared:  0.7265 
## F-statistic:  1327 on 5 and 2492 DF,  p-value: < 2.2e-16

models_comparison(model_1 = mod_6, model_2 = mod_8)

## ANOVA F-test between the models:
## Analysis of Variance Table
## 
## Model 1: data_set$Peso ~ N.gravidanze + Gestazione + Lunghezza + Cranio + 
##     Sesso
## Model 2: data_set$Peso ~ Gestazione + Lunghezza + Cranio + Sesso
##   Res.Df       RSS Df Sum of Sq      F   Pr(>F)   
## 1   2492 188042054                                
## 2   2493 188663107 -1   -621053 8.2304 0.004154 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Models BIC and AIC:
##            Model      BIC      AIC
## 1 Previous Model 35194.06 35159.12
## 2      New Model 35193.65 35152.89

The automatic stepwise procedure returns a model that includes the same predictors as mod_5, which was previously discarded due to its very low correlation with Peso. Therefore, these results confirm that mod_6 represents the most appropriate and statistically robust model specification.

Model Quality Analysis

summary(mod_6)

## 
## Call:
## lm(formula = data_set$Peso ~ Gestazione + Lunghezza + Cranio + 
##     Sesso, data = data_set)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1138.1  -184.4   -17.4   163.3  2626.7 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -6651.6732   135.5952 -49.055  < 2e-16 ***
## Gestazione     31.3262     3.7884   8.269  < 2e-16 ***
## Lunghezza      10.2024     0.3009  33.909  < 2e-16 ***
## Cranio         10.6706     0.4247  25.126  < 2e-16 ***
## SessoM         79.1027    11.2205   7.050 2.31e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 275.1 on 2493 degrees of freedom
## Multiple R-squared:  0.7261, Adjusted R-squared:  0.7257 
## F-statistic:  1652 on 4 and 2493 DF,  p-value: < 2.2e-16

For mod_6, a multicollinearity analysis is first performed among the regressors.

vif(mod_6)

## Gestazione  Lunghezza     Cranio      Sesso 
##   1.654101   2.070582   1.606316   1.038918

The results does not show evidence of multicollinearity among the predictors, all VIF values are below the critical threshold of 5.

The next step is to analyse the outliers and residuals:

par(mfrow = c(2, 2))
plot(mod_6)

The following table provide a summary on the diagnostic plots:

Plot name	Remarks
Residual vs Fitted	The residuals seem more or less distributed around 0, which suggests that there are no problems related to non-linearity.
Normal Q-Q	The points are mostly lined up along the diagonal line but not in the tails, which suggests that the normality assumption may be isn’t met.
Scale-Location	The dispersion of the residuals does not appear to increase systematically with the fitted values, but the pattern seems slightly shifted. This suggests that the homoscedasticity assumption may not be fully satisfied.
Residuals vs Leverage	The presence of a few observations with high leverage suggests the need for further investigation into their potential impact on model stability and coefficient estimates.

To provide a strict statistical validation on the descriptions above the following assumption are validated:

residuals normality assumption
residuals homoscedasticity assumption
residuals no-correlation assumptions

shapiro.test(residuals(mod_6))

## 
##  Shapiro-Wilk normality test
## 
## data:  residuals(mod_6)
## W = 0.97426, p-value < 2.2e-16

bptest(mod_6)

## 
##  studentized Breusch-Pagan test
## 
## data:  mod_6
## BP = 89.198, df = 4, p-value < 2.2e-16

dwtest(mod_6)

## 
##  Durbin-Watson test
## 
## data:  mod_6
## DW = 1.9553, p-value = 0.1318
## alternative hypothesis: true autocorrelation is greater than 0

The unique assumption validated is the residulas no-correlation. At this stage it is important focusing the analysis on leverage points and outliers.

leverage_plot(model = mod_6, number_of_rows = n)

outliers_analysis(mod_6)

##      rstudent unadjusted p-value Bonferroni p
## 1549 9.980701         4.9791e-23   1.2438e-19
## 155  4.948947         7.9596e-07   1.9883e-03
## 1305 4.779033         1.8638e-06   4.6558e-03

## [1] "max cook distance"
## [1] 0.9780498

The outlier test highlights a few observations that may disproportionately affect the model.
Their exclusion allows evaluating whether the resulting model provides a better representation of the data.

outlier_index =  c(1549, 155, 1305)
data_set_no_outliers <- data_set[-outlier_index, ]
final_model = lm(Peso ~ Gestazione + Lunghezza + Cranio + Sesso,
                      data = data_set_no_outliers)
summary(final_model)

## 
## Call:
## lm(formula = Peso ~ Gestazione + Lunghezza + Cranio + Sesso, 
##     data = data_set_no_outliers)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1155.47  -180.59   -14.36   163.23  1131.14 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -6662.1067   131.8102 -50.543  < 2e-16 ***
## Gestazione     28.0935     3.6904   7.613 3.79e-14 ***
## Lunghezza      10.9552     0.2997  36.549  < 2e-16 ***
## Cranio          9.9707     0.4171  23.903  < 2e-16 ***
## SessoM         78.7775    10.9057   7.224 6.70e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 267.2 on 2490 degrees of freedom
## Multiple R-squared:  0.7405, Adjusted R-squared:  0.7401 
## F-statistic:  1776 on 4 and 2490 DF,  p-value: < 2.2e-16

models_comparison(model_1 = mod_6, model_2 = final_model, perform_anova_test = F)

## Models BIC and AIC:
##            Model      BIC      AIC
## 1 Previous Model 35194.06 35159.12
## 2      New Model 35006.04 34971.11

The regression model re-estimated after excluding the influential observations demonstrates improved performance compared to the initial specification.
All diagnostic indicators including the adjusted \(R^2\), F-statistic, and residual standard error show favorable changes, confirming that the removal of outliers enhances the model’s explanatory power and overall reliability.

par(mfrow = c(2, 2))
plot(final_model)

and Executing the residuals and outliers test:

shapiro.test(residuals(final_model))

## 
##  Shapiro-Wilk normality test
## 
## data:  residuals(final_model)
## W = 0.99189, p-value = 1.31e-10

bptest(final_model)

## 
##  studentized Breusch-Pagan test
## 
## data:  final_model
## BP = 7.0521, df = 4, p-value = 0.1332

dwtest(final_model)

## 
##  Durbin-Watson test
## 
## data:  final_model
## DW = 1.9553, p-value = 0.132
## alternative hypothesis: true autocorrelation is greater than 0

outliers_analysis(final_model)

##       rstudent unadjusted p-value Bonferroni p
## 1397 -4.345182         1.4474e-05     0.036113

## [1] "max cook distance"
## [1] 0.07703717

The only assumption that is not fully validated concerns the normality of the residuals.
However, the residual density function shows a distribution reasonably close to normality, suggesting that this deviation does not substantially affect the model’s validity.
Moreover, Cook’s distances have significantly decreased in the refined model, with a maximum value of 0.077, well below the conventional threshold of concern. Therefore, the identified outliers can be reasonably disregarded, and the current model, final_model can be retained as the best-fitting and most robust version.

3. Predictions and Results

Once the model is validated, we will use it to make practical predictions. For example, we can estimate the weight of a female newborn considering a mother in her third pregnancy who will give birth at 39 weeks.

The model includes the variables Cranio and Lunghezza, which represent the main anthropometric predictors of birth weight but not the number of previous pregnancies N.gravidanze.
For the predictive estimation, the average values of these two variables are used, as they correspond to typical neonatal measurements within the analyzed sample.

case_1 = data.frame(
  Gestazione = 39,
  Lunghezza = mean(data_set_no_outliers$Lunghezza),
  Cranio = mean(data_set_no_outliers$Cranio),
  Sesso = "F")

predict(final_model,
        newdata = case_1,
        interval = "confidence")

##        fit      lwr      upr
## 1 3244.307 3229.361 3259.254

Using the validated regression model (final_model), the predicted birth weight for a female newborn at the 39th gestational week, with average anthropometric characteristics (mean skull diameter and body length), is 3244.3 grams.
The 95% confidence interval ranges from 3229.4 g to 3259.3 g, indicating a narrow uncertainty band and suggesting that the model provides a highly stable prediction around the expected mean value.

4. Visualizzations

This section communicate the model’s results and the most significant relationships between variables using the following visual representations.

model_graphs_plot(data_set = data_set_no_outliers,
                  var = "Gestazione",
                  resp = "Peso",
                  cat = "Sesso",
                  title = "Peso per Gestazione by Sesso")

## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'

The graph clearly shows a positive, nearly linear relationship between gestation (Gestazione) and weight (Peso), broken down by gender (female and male).

The regression lines (one for each group) are almost parallel, suggesting that:

The trend of weight growth with gestational length is consistent for both genders.
Males tend to be slightly heavier than females at the same gestational age, which is consistent with the findings of the statistical tests and the linear model.

model_graphs_plot(data_set = data_set_no_outliers,
                  var = "Gestazione",
                  resp = "Peso",
                  cat = "Fumatrici",
                  title = "Peso per Gestazione by Fumatrici")

## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'

The graph shows that the relationship between gestational age (Gestazione) and birth weight is positive and almost linear for both categories of mother, regardless of whether they smoke (0 = no-smoker, 1 = smoker). The two lines almost overlap, suggesting that smoking has no statistically significant effect on birth weight.

However, the greater proportion of smokers compared to non-smokers in the sample reduces the statistical power of the test and increases the uncertainty in the estimate. A more reliable assessment would require a more balanced sample.