1 Scope and Goals of the Analysis

At Neonatal Health Solutions, our goal is to enhance the quality of prenatal and neonatal care through data-driven insights.

This analysis aims to develop a predictive model for newborn birth weight, leveraging clinical and demographic variables collected from three hospitals. Specifically, the analysis aims to:

  • Improve clinical predictions of neonatal outcomes to enable timely interventions.

  • Optimize hospital resources by anticipating which newborns may require intensive care.

  • Identify and prevent risk factors affecting birth weight and pregnancy outcomes.

  • Evaluate hospital practices to ensure consistent and high-quality care across centers.

  • Support strategic planning through data-driven insights that enhance neonatal health policies.

The project represents a step toward our broader mission: using statistical modeling to promote healthier births and more informed medical practices.

2 Dataset Description and Variable Analysis

The dataset used for this analysis contains information on 2,500 newborns collected from three hospitals. It includes both maternal and neonatal clinical variables that may influence birth weight.

Key metrics include maternal age, number of pregnancies, maternal smoking status, gestational duration, newborn length, cranial diameter, and weight, type of delivery, hospital of birth, and sex of the newborn.

The following table shows the first few rows of the dataset, providing a quick glimpse of the key numerical metrics.

Table 1. Dataset Preview
Anni.madre N.gravidanze Fumatrici Gestazione Peso Lunghezza Cranio Tipo.parto Ospedale Sesso
26 0 0 42 3380 490 325 Nat osp3 M
21 2 0 39 3150 490 345 Nat osp1 F
34 3 0 38 3640 500 375 Nat osp2 M
28 1 0 41 3690 515 365 Nat osp2 M
20 0 0 38 3700 480 335 Nat osp3 F
32 0 0 40 3200 495 340 Nat osp2 F

Below is a detailed description of each variable, including its type and statistical classification.

Table 2. Variable Description
Variable Name Meaning Classification
Anni.madre Mother's age Quantitative discrete
N.gravidanze Number of previous pregnancies Quantitative discrete
Fumatrici Mother's smoking status (0 = no, 1 = yes) Qualitative nominal
Gestazione Weeks of gestation Quantitative discrete
Peso Newborn's weight in grams Quantitative continuous
Lunghezza Newborn's length in millimeters Quantitative continuous
Cranio Newborn's head circunference in millimeters Quantitative continuous
Tipo.parto Type of delivery (natural or cesarean) Qualitative nominal
Ospedale Hospital code Qualitative nominal
Sesso Newborn's sex (M/F) Qualitative nominal

3 Descriptive Statistics

This chapter summarizes the key statistical properties of both quantitative and qualitative variables in the dataset. For quantitative variables, measures of position, variability, and shape are calculated. For qualitative variables, the frequency distribution of observations across categories is examined.

The following table shows an overview of the quantitative variables in the dataset.

Table 3. Quantitative Variables in the Dataset
Anni.madre N.gravidanze Gestazione Peso Lunghezza Cranio
26 0 42 3380 490 325
21 2 39 3150 490 345
34 3 38 3640 500 375
28 1 41 3690 515 365
20 0 38 3700 480 335
32 0 40 3200 495 340

3.1 Position Indexes

Below is a summary of the main position indexes for the quantitative variables.

Table 4. Position Indexes
Variable Mean Mode Min Q1 Median Q3 Max
Anni.madre 28.16 30 0.00 25.00 28.00 32.00 46.00
N.gravidanze 0.98 0 0.00 0.00 1.00 1.00 12.00
Gestazione 38.98 40 25.00 38.00 39.00 40.00 43.00
Peso 3,284.08 3300 830.00 2,990.00 3,300.00 3,620.00 4,930.00
Lunghezza 494.69 500 310.00 480.00 500.00 510.00 565.00
Cranio 340.03 340 235.00 330.00 340.00 350.00 390.00

The summary highlights several points of interest.

The minimum maternal age is recorded as 0 years, which is clearly an error and will need correction. The mean maternal age is 28.16 years, while the most common age (mode) is 30 years. About 25% of mothers are 25 years old or less, while 75% are 32 years old or less. The oldest recorded mother is 46 years old.

The number of pregnancies ranges from 0 to 12, suggesting the presence of exceptional cases. The mode is 0, indicating that most mothers are experiencing their first pregnancy. Additionally, 25% of women report no previous pregnancies, while 75% report one or fewer.

The gestational age has a mode of 40 weeks, which aligns with the typical full-term duration. Values below 37 weeks may indicate premature births, with the lowest gestational age recorded being 25 weeks.

The newborn’s weight ranges from 830 g to 4,930 g. The mean weight is 3,284 g, while the median is 3,300 g. The minimum value likely reflects a severely premature birth.

Length ranges from 310 mm to 565 mm, and head circumference from 235 mm to 390 mm, with very low values probably suggesting prematures births or other cases.

3.2 Variability Indexes

Below is a summary of the main variability indexes for the quantitative variables.

Table 5. Variability Indexes
Variable StdDev Range IQR CV
Anni.madre 5.27 46.00 7.00 18.72%
N.gravidanze 1.28 12.00 1.00 130.51%
Gestazione 1.87 18.00 2.00 4.79%
Peso 525.04 4,100.00 630.00 15.99%
Lunghezza 26.32 255.00 30.00 5.32%
Cranio 16.43 155.00 20.00 4.83%

The Coefficient of Variation (CV) provides a standardized measure of dispersion, allowing comparisons of variability across variables with different scales.

Among the variables, number of pregnancies (CV = 130.5%) shows by far the highest relative variability, reflecting a highly dispersed distribution affected by several extreme values. Maternal age (CV = 18.7%) and newborn weight (CV = 16.0%) display moderate variability, suggesting some heterogeneity within the sample.

In contrast, gestational age (CV = 4.8%), length (CV = 5.3%), and head circumference (CV = 4.8%) exhibit low relative variability, indicating that these measures are relatively less dispersed across observations.

3.3 Shape Indexes

Below is a summary of the main shape indexes for the quantitative variables.

Table 6. Shape Indexes
Variable Skewness Kurtosis
Anni.madre 0.04 0.38
N.gravidanze 2.51 10.99
Gestazione −2.07 8.26
Peso −0.65 2.03
Lunghezza −1.51 6.49
Cranio −0.79 2.95

The shape analysis highlights differences in symmetry and tail behavior among the variables.

Maternal age has near-zero skewness (0.04) and a slightly positive kurtosis (0.38), indicating an almost symmetric and mildly leptokurtic distribution.

Number of pregnancies shows strong positive skewness (2.51) and high positive kurtosis (10.99), suggesting a right-skewed and sharply leptokurtic distribution, with most women having few pregnancies and a small number with very high counts.

Gestational age displays negative skewness (−2.07) and high positive kurtosis (8.26), meaning it is left-skewed and leptokurtic, with most pregnancies concentrated around full term and a few markedly shorter durations.

Newborn weight, length, and head circumference all exhibit negative skewness (0.65, -1.51, and -0.79 respectivley) and positive kurtosis (2.03, 6.49, and 2.95 respectively), indicating asymmetric distributions with longer left tails and peaked shapes relative to the normal distribution.

3.4 Outliers Analysis

Outlier detection is an essential step to identify unusual or extreme observations that may affect the accuracy of the statistical analysis and model estimation. These values can result from data entry errors, measurement inaccuracies, or genuine but rare clinical cases.

Below is a boxplot representation for all quantitative variables, where observations identified as outliers are highlighted in red. This visualization provides an overview of the variability within each variable and helps assess the presence and extent of extreme values in the dataset.

Below is a summary table reporting, for each quantitative variable, the number and values of detected outliers.

Table 7. Outliers summary
Variable Outliers_count Outliers_values
Anni.madre 13 0, 1, 13, 14, 43, 44, 45, 46
N.gravidanze 246 3, 4, 5, 6, 7, 8, 9, 10, 11, 12
Gestazione 67 25, 26, 27, 28, 29, 30, 31, 32, 33, 34
Peso 69 830, 900, 930, 980, 990, 1140, 1170, 1180, 1190, 1280, 1285, 1300, 1340, 1370, 1390, 1410, 1430, 1450, 1500, 1550, 1560, 1580, 1600, 1615, 1620, 1690, 1720, 1730, 1750, 1770, 1780, 1800, 1840, 1850, 1890, 1900, 1950, 1960, 1970, 1980, 2000, 2040, 4580, 4600, 4620, 4650, 4680, 4690, 4700, 4720, 4760, 4810, 4900, 4930
Lunghezza 59 310, 315, 320, 325, 340, 345, 355, 360, 370, 380, 385, 390, 400, 405, 410, 420, 425, 430, 560, 565
Cranio 48 235, 245, 253, 254, 265, 266, 267, 270, 272, 273, 274, 275, 276, 277, 278, 280, 285, 287, 289, 290, 292, 293, 294, 295, 297, 298, 299, 381, 382, 383, 384, 385, 386, 390

The table highlights several notable aspects. Maternal age shows 13 outliers, including implausible values such as 0 and 1, which most likely result from data entry errors.

Number of pregnancies presents a large number of outliers (246 cases), primarily because both the first and third quartiles are equal to 1, meaning that any value from 3 and above is statistically classified as an outlier.

Gestational age includes 67 outliers, mainly corresponding to very early deliveries between 25 and 33 weeks.

For newborn weight, 69 outliers were identified, concentrated at both extremes — below 2000 g and above 4500 g. These observations likely reflect premature or macrosomic births rather than data errors.

Length and head circumference also display several outliers, particularly at the tails of their distributions, possibly related to the same clinical conditions influencing birth weight.

Overall, while some outliers appear to stem from recording inaccuracies, others represent genuine biological variability. Based on these considerations, maternal age values equal to 0 or 1 have been recoded as NA for data consistency.

4 Frequency distribution

In this section, we examine the distribution of each quantitative variable through three complementary perspectives.

First, the frequency class distribution summarizes how observations are distributed across intervals, allowing the identification of potential asymmetries or clustering of values.

Next, the density plots provide a visual representation of the continuous distribution of each variable, helping to assess the overall shape and detect deviations from normality such as skewness or heavy tails.

Finally, the Shapiro–Wilk test formally evaluates the assumption of normality for each variable.

4.1 Variable weight

## 
##  Shapiro-Wilk normality test
## 
## data:  df$Peso
## W = 0.97066, p-value < 2.2e-16

The Shapiro–Wilk test for the variable Peso (newborn weight) returned a p-value < 2.2×10⁻¹⁶, far beloved the 0.05 threshold. Although the null hypothesis of normality is statistically rejected, this result should be interpreted with caution given the large sample size. With over 2,000 observations, even small deviations from normality tend to yield significant p-values. The slight departure from normality is also consistent with the mild negative skewness observed earlier, likely due to a subset of low-birth-weigh

4.2 Variable mother’s age

## 
##  Shapiro-Wilk normality test
## 
## data:  df$Anni.madre
## W = 0.99491, p-value = 1.477e-07

4.3 Weeks of Gestation

## 
##  Shapiro-Wilk normality test
## 
## data:  df$Gestazione
## W = 0.83328, p-value < 2.2e-16

4.4 Variable Length

## 
##  Shapiro-Wilk normality test
## 
## data:  df$Lunghezza
## W = 0.90941, p-value < 2.2e-16

4.5 Head’s circumference

## 
##  Shapiro-Wilk normality test
## 
## data:  df$Cranio
## W = 0.96357, p-value < 2.2e-16

Overall, none of the quantitative variables pass the Shapiro–Wilk normality test. However, this result should be interpreted with caution, as the large number of observations can strongly influence the test outcome, making even small deviations from normality appear statistically significant. This limitation is well-documented in statistical literature [1].

4.6 Categorical variables

The figure below displays the relative frequencies of the three categorical variables in the dataset: Sesso (newbonr’s sex), Fumatrici (maternal smoking status), and Ospedale (hospital ID). These distributions provide an overview of how observations are divided among categories, offering useful context for understanding group proportions.

There is a clear disparity, with non-smoking mothers representing over 96% of the sample. This suggests that smoking during pregnancy is relatively rare within this dataset.

The births appear to be almost evenly distributed among the three hospitals, with a slightly higher proportion recorded in Ospedale 3. This indicates that the data collection is reasonably balanced across institutions.

The distribution between male and female newborns is nearly balanced, indicating no substantial sex bias in the sample.

5 Statistical Tests

In this section, we conduct a series of statistical tests to validate specific hypotheses derived from the study objectives. In particular, we assess:

  1. Whether the proportion of cesarean deliveries differs significantly across hospitals;

  2. Whether the mean weight and length of newborns in this sample are statistically equal to those observed in the general population;

  3. Whether anthropometric measures (weight, length, and head circumference) differ significantly between male and female newborns.

Each test is performed using the most appropriate statistical method, depending on the type and distribution of the variables involved.

5.1 Differences in Cesarean Delivery Rates Across Hospitals

The first hypothesis evaluates whether the three hospitals in the dataset differ in the proportion of cesarean deliveries. To investigate this, we examine the association between delivery type (Tipo.parto) and hospital (Ospedale) using the chi-square test of independence.

This statistical test assesses whether two categorical variables are independent—meaning that the distribution of one variable does not change across the levels of the other. In this context, it verifies whether the frequency of cesarean versus natural births is consistent across all hospitals, or whether certain hospitals perform cesarean deliveries more or less frequently than others.

The hypotheses of the chi-square test are:

Null hypothesis (H₀):
The variables Tipo.parto and Ospedale are independent.
This means that the mode of delivery does not depend on the hospital.

Alternative hypothesis (H₁):
The variables Tipo.parto and Ospedale are not independent.
This implies that the mode of delivery does depend on the hospital.

The figure below shows how delivery types are distributed across the three hospitals.

From a visual inspection of the frequency distribution, it is clear that natural delivery is by far the most frequent type of delivery in each hospital. However, based solely on the frequency distribution, it is not possible to determine whether there is a statistical dependency between Tipo.parto and Ospedale.

Below are presented the results of the chi-square test.

## 
##  Pearson's Chi-squared test
## 
## data:  contingency_tab
## X-squared = 1.0972, df = 2, p-value = 0.5778

The obtained p-value is 0.5778. Considering a significance level of 5%, this p-value falls well within the region of acceptance for the null hypothesis. Therefore, we do not have sufficient evidence to reject the null hypothesis, and we conclude that the categorical variables Tipo.parto and Ospedale are statistically independent.

5.2 One-Sample Student’s t-Test for Weight and Length

This subchapter evaluates whether the mean weight (Peso) and length (Lunghezza) of the newborns in the sample differ significantly from the population averages. The one-sample Student’s t-test is applied, which tests whether the mean of a sample is statistically equal to a known population mean, as both p-values are above the 5% significance level.

The t-test is suitable for variables that are approximately continuous and measured on an interval scale. It compares the observed sample mean with the hypothesized population mean while accounting for sample variability.

The reference population values are:

  • Population mean weight: 3300 g [2]

  • Population mean length: 495 mm [3]

The system of hypotheses for each variable is:

Null hypothesis (H₀):
The mean of the sample is equal to the population mean.

Alternative hypothesis (H₁):
The mean of the sample is different from the population mean

Separate t-tests are conducted for Peso and Lunghezza to statistically evaluate these hypotheses. Below are the results.

## 
##  One Sample t-test
## 
## data:  df$Peso
## t = -1.516, df = 2499, p-value = 0.1296
## alternative hypothesis: true mean is not equal to 3300
## 95 percent confidence interval:
##  3263.490 3304.672
## sample estimates:
## mean of x 
##  3284.081
## 
##  One Sample t-test
## 
## data:  df$Lunghezza
## t = -0.58514, df = 2499, p-value = 0.5585
## alternative hypothesis: true mean is not equal to 495
## 95 percent confidence interval:
##  493.6598 495.7242
## sample estimates:
## mean of x 
##   494.692

The one-sample t-tests indicate that the sample means for both Peso and Lunghezza do not differ significantly from the population averages.

For Peso, the sample mean is 3284.08 g, compared to the population mean of 3300 g. The test yields a t-value of -1.516 and a p-value of 0.1296. Since the p-value exceeds the common significance level of 0.05, the null hypothesis cannot be rejected. The 95% confidence interval (3263.49–3304.67 g) includes the population mean, further confirming no significant difference.

For Lunghezza, the sample mean is 494.69 mm, with a t-value of -0.585 and a p-value of 0.5585. Again, the null hypothesis cannot be rejected. The 95% confidence interval (493.66–495.72 mm) encompasses the population mean of 495 mm.

Overall, both tests support that the sample is consistent with the general population in terms of birth weight and length.

5.3 Comparison of Anthropometric Measures by Sex

This analysis investigates whether there are statistically significant differences in newborn anthropometric measures (Peso, Lunghezza, and Cranio) between male and female newborns.

The table below summarizes the means of Anthropometric Measures by Sex in the sample.

Table 8. Mean of Anthropometric Variables by Sex
Sesso Peso Lunghezza Cranio
F 3,161.13 489.76 337.63
M 3,408.22 499.67 342.45

Independent two-sample Student’s t-tests are applied to compare the means of each measure across the two groups.

The Welch test differs from the traditional Student’s t-test in that it does not assume equal variances between the two groups. This makes it more robust and reliable when the sample variances or sample sizes are unequal — a common situation in real datasets. R uses the Welch method by default precisely because it provides more accurate Type I error control under variance heterogeneity.

For each variable, the hypothesis system is:

Null hypothesis (H₀):
The mean of the measure is statistically equal between males and females.

Alternative hypothesis (H₁):
The mean of the measure statistically differs between males and females.

Starting with the anthropometric variable of newborns’ weight, the scatter plot below displays the distribution of weights by sex. From a visual perspective, the plot suggests that males tend to have higher weights than females.

To assess whether this apparent difference is statistically significant, a two-sample t-test is performed to evaluate whether the mean weight is the same for both groups or whether a true difference exists.

## 
##  Welch Two Sample t-test
## 
## data:  df$Peso[df$Sesso == "M"] and df$Peso[df$Sesso == "F"]
## t = 12.106, df = 2490.7, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  207.0615 287.1051
## sample estimates:
## mean of x mean of y 
##  3408.215  3161.132

The Welch two-sample t-test for the variable Peso (newborn’s weight) shows a statistically significant difference between male and female infants (p-value < 2.2e-16). The 95% confidence interval for the difference in mean weights ranges from 207.1 g to 287.1 g. Statistically, this interval is significant because it does not include 0, meaning that a difference of zero is not plausible based on the sample data. Therefore, the test strongly supports the conclusion that male newborns are, on average, heavier than female newborns.

The analysis then turns to newborns’ length. The boxplot below shows the distribution by sex and again suggests that males tend to have higher values than females.

## 
##  Welch Two Sample t-test
## 
## data:  df$Lunghezza[df$Sesso == "M"] and df$Lunghezza[df$Sesso == "F"]
## t = 9.582, df = 2459.3, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##   7.876273 11.929470
## sample estimates:
## mean of x mean of y 
##  499.6672  489.7643

The Welch two-sample t-test for Lunghezza (newborn’s length) reveals a significant difference in mean length between male and female newborns. Males are, on average, longer (499.7 mm) compared to females (489.8 mm). The p-value (< 2.2e-16) confirms that the difference is highly statistically significant. The 95% confidence interval for the mean difference, ranging from 7.88 mm to 11.93 mm, further supports the presence of a consistent sex-related effect on newborn length. Because the interval excludes 0, the null hypothesis of equal means is rejected.

Finally, the analysis focuses on newborn’s head circumference, with the boxplot suggesting once again that males tend to have higher values.

## 
##  Welch Two Sample t-test
## 
## data:  df$Cranio[df$Sesso == "M"] and df$Cranio[df$Sesso == "F"]
## t = 7.4102, df = 2491.4, p-value = 1.718e-13
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  3.541270 6.089912
## sample estimates:
## mean of x mean of y 
##  342.4486  337.6330

For Cranio (newborn’s head circumference), the Welch two-sample t-test indicates that male newborns have larger head circumferences (342.4 mm) than females (337.6 mm). The p-value (1.718e-13) is well below the 5% threshold, allowing rejection of the null hypothesis of equal means. The 95% confidence interval for the difference in head circumference ranges from 3.54 mm to 6.09 mm, confirming a statistically significant and biologically relevant difference between sexes. Because the interval excludes 0, the null hypothesis of equal means between males and females is rejected.

Overall, all statistical tests conducted on the anthropometric variables indicate significant differences in mean values between male and female newborns. This confirms that sex is an important factor and should be included as a relevant predictor in the construction of the linear regression model.

6 Construction of the regression model

The aim of this chapter is to develop a statistical model capable of predicting the newborn’s weight (Peso) using the clinical and demographic variables available in the dataset.

To build an effective and interpretable predictive model, a stepwise regression strategy will be adopted. This approach allows variables to be added or removed based on their statistical contribution, leading to a parsimonious model that retains only the most informative predictors.

6.1 Correlation analysis

Before implementing the stepwise procedure, it is essential to examine the relationships among the quantitative variables. In particular, correlation analysis provides an initial screening of how strongly each predictor is associated with the newborn’s weight, while also revealing potential issues such as multicollinearity.

The pairwise correlation matrix below reports both the correlation coefficients and the corresponding scatter plots for all quantitative variables.

A first observation concerns the anthropometric measures. As expected, the strongest correlations with newborn weight (Peso) are found for Lunghezza (0.80) and Cranio (0.70). These strong positive associations are also visually apparent in the scatter plots, which show a clear upward linear trend. The high correlation with Gestazione (0.63) further indicates that weight tends to increase with gestational age.

In contrast, the correlations between Peso and Anni.madre as well as between Peso and N.gravidanze are close to zero. The corresponding scatter plots display widely dispersed points with nearly horizontal fitted lines, suggesting no meaningful linear relationship between these variables and newborn weight.

The analysis also highlights substantial positive correlations among the predictors—notably between Lunghezza and Cranio (0.60) and between Lunghezza and Gestazione (0.62). These associations indicate that multicollinearity may be present among the anthropometric and gestational variables, and therefore must be checked in the subsequent regression modeling phase.

For categorical variables, calculating correlation with Peso is not appropriate. Instead, their relationship with newborn weight will be explored visually using boxplots for each categorical variable.

The first boxplot displays newborn’s weight (Peso) by sex (Sesso). As confirmed by the t-test in Section 5.3, there is a statistically significant difference in mean weight between male and female newborns, with the test results leading to the rejection of the null hypothesis.

Below is the boxplot of Peso by mother’s smoking status (Fumatrici). Visually, no substantial difference in newborn weight is apparent between smokers and non-smokers. This observation is supported by an independent two-sample t-test, which yields a p-value of 0.3033—well above the 5% significance threshold—indicating that the null hypothesis of equal means cannot be rejected. Therefore, there is no statistically significant difference in Peso based on maternal smoking status, although this relationship will be further evaluated in the regression modeling phase.

## 
##  Welch Two Sample t-test
## 
## data:  Peso by Fumatrici
## t = 1.034, df = 114.1, p-value = 0.3033
## alternative hypothesis: true difference in means between group Non-smoker and group Smoker is not equal to 0
## 95 percent confidence interval:
##  -45.61354 145.22674
## sample estimates:
## mean in group Non-smoker     mean in group Smoker 
##                 3286.153                 3236.346

Finally, the boxplot of Peso by Ospedale is examined. Visually, the boxplots do not indicate any substantial differences in newborn weight across the three hospitals.

6.2 Stepwise approach

6.2.1 Exploration of Primary Effects

To construct an effective predictive model for newborn weight (Peso), a stepwise regression approach is employed. This iterative procedure systematically adds or removes predictor variables based on their statistical contribution to the model, typically measured via criteria such as the Akaike Information Criterion (AIC) or Bayesian Information Criterion (BIC).

The stepwise method balances model complexity and predictive accuracy, helping to identify a parsimonious set of predictors while avoiding overfitting. Both main effects and potential interaction terms are considered, ensuring that relevant relationships between variables are captured in the final model.

Below, the initial regression model is presented, representing the complete model that includes all independent variables (predictors). This model will be referred to as mod1.

## 
## Call:
## lm(formula = Peso ~ ., data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1123.26  -181.53   -14.45   161.05  2611.89 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -6735.7960   141.4790 -47.610  < 2e-16 ***
## Anni.madre          0.8018     1.1467   0.699   0.4845    
## N.gravidanze       11.3812     4.6686   2.438   0.0148 *  
## FumatriciSmoker   -30.2741    27.5492  -1.099   0.2719    
## Gestazione         32.5773     3.8208   8.526  < 2e-16 ***
## Lunghezza          10.2922     0.3009  34.207  < 2e-16 ***
## Cranio             10.4722     0.4263  24.567  < 2e-16 ***
## Tipo.partoNat      29.6335    12.0905   2.451   0.0143 *  
## Ospedaleosp2      -11.0912    13.4471  -0.825   0.4096    
## Ospedaleosp3       28.2495    13.5054   2.092   0.0366 *  
## SessoM             77.5723    11.1865   6.934 5.18e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 274 on 2487 degrees of freedom
##   (2 osservazioni eliminate a causa di valori mancanti)
## Multiple R-squared:  0.7289, Adjusted R-squared:  0.7278 
## F-statistic: 668.7 on 10 and 2487 DF,  p-value: < 2.2e-16

The initial regression model (mod1) includes all independent variables to predict newborn weight (Peso). The model explains a substantial portion of the variability in weight, with an adjusted R-squared of 0.7278, indicating that approximately 73% of the variation in Peso is accounted for by the predictors.

Among the predictors, Gestazione, Lunghezza, Cranio, and SessoM show very strong statistical significance (p < 0.001), indicating that longer gestation, larger anthropometric measures, and male sex are associated with higher birth weight. N.gravidanze, Tipo.partoNat, and Ospedaleosp3 are also significant at the 5% level, though their effect sizes are smaller.

Anni.madre, FumatriciSmoker, and Ospedaleosp2 are not statistically significant.

Now, a revised regression model (mod2) will be explored, removing the categorical variable Fumatrici due to its lack of statistical significance in mod1.

## 
## Call:
## lm(formula = Peso ~ Anni.madre + N.gravidanze + Gestazione + 
##     Lunghezza + Cranio + Tipo.parto + Ospedale + Sesso, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1122.63  -181.96   -14.91   161.39  2615.07 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -6735.4444   141.4845 -47.606  < 2e-16 ***
## Anni.madre        0.8118     1.1467   0.708   0.4791    
## N.gravidanze     11.1201     4.6627   2.385   0.0172 *  
## Gestazione       32.3210     3.8138   8.475  < 2e-16 ***
## Lunghezza        10.3064     0.3006  34.285  < 2e-16 ***
## Cranio           10.4766     0.4263  24.577  < 2e-16 ***
## Tipo.partoNat    29.3770    12.0888   2.430   0.0152 *  
## Ospedaleosp2    -11.0363    13.4475  -0.821   0.4119    
## Ospedaleosp3     28.5194    13.5038   2.112   0.0348 *  
## SessoM           77.3928    11.1858   6.919 5.77e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 274 on 2488 degrees of freedom
##   (2 osservazioni eliminate a causa di valori mancanti)
## Multiple R-squared:  0.7288, Adjusted R-squared:  0.7278 
## F-statistic: 742.8 on 9 and 2488 DF,  p-value: < 2.2e-16

The overall fit of mod2 remains essentially unchanged, with an adjusted R-squared of 0.7278, indicating that the explanatory power is maintained.

All previously significant predictors—Gestazione, Lunghezza, Cranio, SessoM, N.gravidanze, Tipo.partoNat, and Ospedaleosp3—retain their significance and similar effect sizes. Non-significant variables remain Anni.madre and Ospedaleosp2. It is therefore statistically appropriate to remove Fumatrici.

Next, mod3 is estimated, this time also excluding the variable Anni.madre, which likewise shows no statistical significance in both mod1 (p = 0.4845) and mod2 (p = 0.4791).

## 
## Call:
## lm(formula = Peso ~ N.gravidanze + Gestazione + Lunghezza + Cranio + 
##     Tipo.parto + Ospedale + Sesso, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1113.18  -181.16   -16.58   161.01  2620.19 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -6707.4293   135.9438 -49.340  < 2e-16 ***
## N.gravidanze     12.3619     4.3325   2.853  0.00436 ** 
## Gestazione       31.9909     3.7896   8.442  < 2e-16 ***
## Lunghezza        10.3086     0.3004  34.316  < 2e-16 ***
## Cranio           10.4922     0.4254  24.661  < 2e-16 ***
## Tipo.partoNat    29.2803    12.0817   2.424  0.01544 *  
## Ospedaleosp2    -11.0227    13.4363  -0.820  0.41209    
## Ospedaleosp3     28.6408    13.4886   2.123  0.03382 *  
## SessoM           77.4412    11.1756   6.930 5.36e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 273.9 on 2491 degrees of freedom
## Multiple R-squared:  0.7287, Adjusted R-squared:  0.7278 
## F-statistic: 836.3 on 8 and 2491 DF,  p-value: < 2.2e-16

The performance of the model remains essentially unchanged, with an adjusted R-squared of 0.7278, confirming that excluding Anni.madre does not reduce the model’s explanatory capacity.

All remaining predictors behave consistently with previous results. The variables Gestazione, Lunghezza, Cranio, and Sesso (specifically the level SessoM) remain highly significant and continue to represent the strongest determinants of Peso. The variable N.gravidanze becomes slightly more influential (Estimate = 12.36) and is statistically significant. Among the categorical predictors, Tipo.parto (level Tipo.partoNat) and Ospedale (level Ospedaleosp3) also retain their significance, while Ospedaleosp2 remains non-significant.

Overall, the exclusion of Anni.madre seems to lead to a more parsimonious model without any loss of predictive performance

To further support the model selection, the AIC and BIC values of the regression models estimated so far (mod1, mod2, and mod3) are compared. Both criteria reward good model fit while penalizing unnecessary complexity.

##      df      BIC
## mod1 12 35215.45
## mod2 11 35208.84
## mod3 10 35228.03
##      df      AIC
## mod1 12 35145.57
## mod2 11 35144.78
## mod3 10 35169.79

The results show that mod2 achieves the lowest AIC (35144.78) and the lowest BIC (35208.84), making it the best-performing model among the three.

Although mod3 is more parsimonious (one fewer predictor), it performs worse according to both AIC (35169.79) and BIC (35228.03). This indicates that removing Anni.madre slightly harms the model’s fit without offering a sufficient reduction in complexity to compensate for it.

Overall, it is preferable to retain Anni.madre in the model, not only because mod2 provides the best statistical performance based on the AIC and the BIC values, but also because maternal age is a well-established clinical control variable in neonatal health research. Including it helps ensure that the model appropriately accounts for known demographic influences.

The next model, mod4, evaluates the effect of removing the categorical variable Ospedale. In the previous mod2, only the level Ospedaleosp3 showed statistical significance at the 5% level (p = 0.0348), while Ospedaleosp2 was clearly non-significant (p = 0.4119).

## 
## Call:
## lm(formula = Peso ~ Anni.madre + N.gravidanze + Gestazione + 
##     Lunghezza + Cranio + Tipo.parto + Sesso, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1139.50  -181.60   -14.59   160.14  2633.16 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -6737.9269   141.5747 -47.593  < 2e-16 ***
## Anni.madre        0.8793     1.1479   0.766   0.4438    
## N.gravidanze     11.4176     4.6676   2.446   0.0145 *  
## Gestazione       32.6300     3.8180   8.546  < 2e-16 ***
## Lunghezza        10.2839     0.3009  34.176  < 2e-16 ***
## Cranio           10.4896     0.4268  24.574  < 2e-16 ***
## Tipo.partoNat    30.1222    12.1038   2.489   0.0129 *  
## SessoM           77.8374    11.2008   6.949 4.67e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 274.4 on 2490 degrees of freedom
##   (2 osservazioni eliminate a causa di valori mancanti)
## Multiple R-squared:  0.7278, Adjusted R-squared:  0.727 
## F-statistic: 950.9 on 7 and 2490 DF,  p-value: < 2.2e-16

After removing the categorical variable Ospedale, the new model (mod4) maintains essentially the same explanatory power, with an adjusted R-squared of 0.7270, nearly identical to that of mod2 (0.7278). All key predictors—Gestazione, Lunghezza, Cranio, SessoM, N.gravidanze, and Tipo.partoNat—remain significant and stable in magnitude, confirming that the hospital variable was not contributing materially to the prediction of Peso.

The Bayesian Information Criterion (BIC) will be used from this point onward as the primary metric for model comparison. BIC imposes a stronger penalty for model complexity than AIC and therefore tends to favor simpler, more interpretable and parsimonious models.

##      df      BIC
## mod1 12 35215.45
## mod2 11 35208.84
## mod3 10 35228.03
## mod4  9 35202.49

Using BIC as the criterion, mod4 clearly emerges as the best-performing model among those examined so far. Its BIC value (35202.49) is the lowest, indicating the most favorable trade-off between goodness of fit and model simplicity. This improvement is achieved despite removing an entire categorical predictor with multiple levels, reinforcing the idea that Ospedale was not a meaningful determinant of neonatal weight.

To further justify the removal of the variable Ospedale, an ANOVA was performed to compare the nested models mod2 (including Ospedale) and mod4 (excluding Ospedale). The test evaluates whether removing the hospital variable significantly worsens the model fit.

## Analysis of Variance Table
## 
## Model 1: Peso ~ Anni.madre + N.gravidanze + Gestazione + Lunghezza + Cranio + 
##     Tipo.parto + Ospedale + Sesso
## Model 2: Peso ~ Anni.madre + N.gravidanze + Gestazione + Lunghezza + Cranio + 
##     Tipo.parto + Sesso
##   Res.Df       RSS Df Sum of Sq      F   Pr(>F)   
## 1   2488 186833870                                
## 2   2490 187530244 -2   -696373 4.6367 0.009774 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The results show an F-statistic of 4.637 with a p-value of 0.0098. This indicates that, strictly statistically, the simpler model (mod4) without Ospedale fits slightly worse than the more complex model (mod2). However, the practical impact on the model is minimal: the adjusted R-squared remains virtually unchanged, and the BIC is lower for mod4, reflecting a more parsimonious and efficient model.

Considering that Ospedale likely captures structural or demographic differences rather than a direct biological effect on neonatal weight, and that BIC favors simpler models, it is reasonable to prefer mod4.

Having confirmed that removing the variable Ospedale improves model parsimony without reducing explanatory power, a new model is then explored—mod5—in which the categorical variable Tipo.parto is also excluded.

## 
## Call:
## lm(formula = Peso ~ Anni.madre + N.gravidanze + Gestazione + 
##     Lunghezza + Cranio + Sesso, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1159.84  -181.98   -15.04   164.16  2634.01 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -6712.0654   141.3399 -47.489  < 2e-16 ***
## Anni.madre       0.8909     1.1491   0.775   0.4382    
## N.gravidanze    11.1203     4.6710   2.381   0.0174 *  
## Gestazione      32.6914     3.8219   8.554  < 2e-16 ***
## Lunghezza       10.2461     0.3008  34.058  < 2e-16 ***
## Cranio          10.5239     0.4271  24.642  < 2e-16 ***
## SessoM          77.8998    11.2125   6.948 4.72e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 274.7 on 2491 degrees of freedom
##   (2 osservazioni eliminate a causa di valori mancanti)
## Multiple R-squared:  0.7271, Adjusted R-squared:  0.7264 
## F-statistic:  1106 on 6 and 2491 DF,  p-value: < 2.2e-16

The performance of mod5 remains very similar to the previous specifications: the adjusted R-squared decreases only minimally (from 0.7270 in mod4 to 0.7264 in mod5), indicating that the explanatory power of the model is essentially preserved.

All key biological predictors—Gestazione, Lunghezza, Cranio, and SessoM—remain highly significant, as well as N.gravidanze, confirming their central role in determining neonatal weight, while Anni.madre remains non-significant but is retained due to its clinical relevance as a control variable.

##      df      BIC
## mod1 12 35215.45
## mod2 11 35208.84
## mod3 10 35228.03
## mod4  9 35202.49
## mod5  8 35200.87

Moreover, the comparison of BIC values confirms that mod5 achieves the best overall performance among all models without interactions. With the lowest BIC (35200.87), mod5 is preferred because the criterion explicitly rewards parsimony, and this model attains a simpler specification—using fewer predictors—while maintaining virtually the same adjusted R² as the more complex alternatives. This strengthens the justification for selecting mod5 as the most appropriate baseline model before introducing interaction terms and non-linear terms.

6.2.2 Exploration of Interaction Effects

Up to this point, several regression models without interaction terms have been estimated and compared, progressively removing non-significant predictors while monitoring model performance using the BIC criterion. Among all specifications tested so far, the preferred model without interactions is mod5, which includes the following predictors: Anni.madre, N.gravidanze, Gestazione, Lunghezza, Cranio, and Sesso.

Having established a baseline model with main effects only, the next step is to investigate whether interaction effects may further improve the model. Interactions help capture situations in which predictors do not act independently but jointly influence the outcome.

In the following section, potential interactions between clinically or statistically relevant variables will be explored to assess whether they enhance the explanatory accuracy of the model.

The first interaction investigated is between Gestazione and Sesso, producing model mod_int1. The aim is to evaluate whether the effect of gestational age on neonatal weight differs between males and females.

## 
## Call:
## lm(formula = Peso ~ Anni.madre + N.gravidanze + Gestazione + 
##     Lunghezza + Cranio + Sesso + Gestazione:Sesso, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1155.27  -181.13   -14.24   162.85  2632.48 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       -6603.2349   171.4450 -38.515  < 2e-16 ***
## Anni.madre            0.9014     1.1491   0.784    0.433    
## N.gravidanze         11.0587     4.6710   2.368    0.018 *  
## Gestazione           29.8161     4.6021   6.479 1.11e-10 ***
## Lunghezza            10.2501     0.3008  34.071  < 2e-16 ***
## Cranio               10.5250     0.4270  24.646  < 2e-16 ***
## SessoM             -185.2394   234.9202  -0.789    0.430    
## Gestazione:SessoM     6.7431     6.0131   1.121    0.262    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 274.7 on 2490 degrees of freedom
##   (2 osservazioni eliminate a causa di valori mancanti)
## Multiple R-squared:  0.7272, Adjusted R-squared:  0.7265 
## F-statistic: 948.3 on 7 and 2490 DF,  p-value: < 2.2e-16
##          df      BIC
## mod5      8 35200.87
## mod_int1  9 35207.43

The results show that the interaction term Gestazione x SessoM is not statistically significant (p = 0.262). This pattern indicates that allowing different gestational-age slopes for males and females does not meaningfully improve the model.

Importantly, the adjusted R-squared (0.7265) is virtually unchanged compared to the preferred model without interactions (mod5, adjusted R² = 0.7264), and no improvement is observed in the residual standard error. These results suggest that adding this interaction does not enhance the explanatory power of the model.

Given the lack of statistical and practical contribution, the interaction term Gestazione × Sesso does not appear justified and will not be retained.

The second interaction explored is between Gestazione and maternal smoking status (Fumatrici). This model, named mod_int2, assesses whether the effect of gestational age on neonatal weight differs between smokers and non-smokers.

## 
## Call:
## lm(formula = Peso ~ Anni.madre + N.gravidanze + Gestazione + 
##     Lunghezza + Cranio + Sesso + Fumatrici + Gestazione:Fumatrici, 
##     data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1159.92  -182.47   -16.54   164.21  2630.79 
## 
## Coefficients:
##                              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                -6728.7262   142.1541 -47.334  < 2e-16 ***
## Anni.madre                     0.8587     1.1492   0.747   0.4550    
## N.gravidanze                  11.4565     4.6771   2.449   0.0144 *  
## Gestazione                    33.4900     3.8616   8.673  < 2e-16 ***
## Lunghezza                     10.2260     0.3012  33.956  < 2e-16 ***
## Cranio                        10.5150     0.4271  24.621  < 2e-16 ***
## SessoM                        78.6576    11.2256   7.007 3.12e-12 ***
## FumatriciSmoker              785.9930   757.7101   1.037   0.2997    
## Gestazione:FumatriciSmoker   -20.7952    19.2877  -1.078   0.2811    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 274.7 on 2489 degrees of freedom
##   (2 osservazioni eliminate a causa di valori mancanti)
## Multiple R-squared:  0.7273, Adjusted R-squared:  0.7265 
## F-statistic:   830 on 8 and 2489 DF,  p-value: < 2.2e-16
##          df      BIC
## mod5      8 35200.87
## mod_int2 10 35214.13

The results show that neither the main effect of FumatriciSmoker (p = 0.2997) nor the interaction term Gestazione x FumatriciSmoker (p = 0.2811) is statistically significant.

Consistent with this, the model’s adjusted R-squared (0.7265) and residual standard error remain virtually unchanged relative to the preferred baseline model (mod5). This confirms that the interaction provides no improvement in explanatory power.

Therefore, the insignificant interaction suggests that the growth trajectory across gestational weeks appears similar for newborns of smokers and non-smokers, with no detectable differential pattern in weight gain.

Addittionaly, the BIC value for mod_int2 increases compared to mod5.

Given the absence of statistical support and the lack of contribution to model performance, the interaction between Gestazione x Fumatrici will not be retained.

The third interaction, which will be explored in mod_int3 investigates whether the effect of gestational age (Gestazione) on neonatal weight varies depending on maternal age (Anni.madre). This evaluates a clinically plausible hypothesis: older and younger mothers may experience different fetal growth patterns across gestational weeks.

## 
## Call:
## lm(formula = Peso ~ Anni.madre + N.gravidanze + Gestazione + 
##     Lunghezza + Cranio + Sesso + Anni.madre:Gestazione, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1145.57  -182.19   -14.16   162.57  2633.13 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           -5046.8110   626.4363  -8.056 1.21e-15 ***
## Anni.madre              -56.2111    20.9597  -2.682  0.00737 ** 
## N.gravidanze             11.1063     4.6649   2.381  0.01735 *  
## Gestazione               -9.7323    16.0102  -0.608  0.54332    
## Lunghezza                10.2116     0.3007  33.957  < 2e-16 ***
## Cranio                   10.5231     0.4265  24.672  < 2e-16 ***
## SessoM                   78.2130    11.1986   6.984 3.66e-12 ***
## Anni.madre:Gestazione     1.4718     0.5394   2.728  0.00641 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 274.4 on 2490 degrees of freedom
##   (2 osservazioni eliminate a causa di valori mancanti)
## Multiple R-squared:  0.7279, Adjusted R-squared:  0.7271 
## F-statistic: 951.6 on 7 and 2490 DF,  p-value: < 2.2e-16

The effect of gestational age on neonatal weight varied across maternal age was tested by including the interaction term Anni.madre x Gestazione. Although the coefficient was statistically significant (p = 0.00641), the effect size was small and the adjusted R² improved only marginally (0.7264 → 0.7271). Moreover, the inclusion of this term destabilized the main effects, with gestational age becoming non‑significant and even negative, which contradicts biological expectations.

Model comparison using BIC confirmed that the interaction did not enhance overall fit (BIC increased from 35200.87 to 35201.23).

##          df      BIC
## mod5      8 35200.87
## mod_int3  9 35201.23

Importantly, while maternal age and gestational age are individually recognized as predictors of birth weight, there is no robust clinical evidence that their interaction exerts a meaningful effect.

Overall, although the interaction term reaches statistical significance, it does not materially enhance model performance and introduces additional complexity without improving predictive or interpretative value. For these reasons, the interaction between Anni.madre and Gestazione is not retained for the final model selection.

The interaction between N.gravidanze and Gestazione is now explored in the model mod_int4.

## 
## Call:
## lm(formula = Peso ~ Anni.madre + N.gravidanze + Gestazione + 
##     Lunghezza + Cranio + Sesso + N.gravidanze:Gestazione, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1159.02  -181.98   -14.32   163.76  2632.87 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             -6619.5765   163.7454 -40.426  < 2e-16 ***
## Anni.madre                  0.8826     1.1491   0.768    0.443    
## N.gravidanze              -67.1692    70.1506  -0.958    0.338    
## Gestazione                 30.2786     4.3885   6.900 6.59e-12 ***
## Lunghezza                  10.2458     0.3008  34.059  < 2e-16 ***
## Cranio                     10.5302     0.4271  24.656  < 2e-16 ***
## SessoM                     77.8221    11.2121   6.941 4.95e-12 ***
## N.gravidanze:Gestazione     2.0182     1.8044   1.119    0.263    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 274.7 on 2490 degrees of freedom
##   (2 osservazioni eliminate a causa di valori mancanti)
## Multiple R-squared:  0.7272, Adjusted R-squared:  0.7265 
## F-statistic: 948.3 on 7 and 2490 DF,  p-value: < 2.2e-16
##          df      BIC
## mod5      8 35200.87
## mod_int4  9 35207.44

The coefficient for the interaction term is not statistically significant (p = 0.263), and the adjusted R-squared remains essentially unchanged compared to the baseline model mod5. Additionally, the BIC increases from 35200.87 (mod5) to 35207.44 (mod_int4), indicating that including this interaction does not improve model parsimony or explanatory power. Consequently, there is no evidence that the effect of gestational age on newborn’s weight depends on the number of previous pregnancies, and the interaction term is not retained in the final model.

The interaction between Cranio and Lunghezza is explored in the model mod_int5.

## 
## Call:
## lm(formula = Peso ~ Anni.madre + N.gravidanze + Gestazione + 
##     Lunghezza + Cranio + Sesso + Cranio:Lunghezza, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1161.16  -180.93   -12.18   165.61  2860.38 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -1.833e+03  1.019e+03  -1.799   0.0721 .  
## Anni.madre        8.940e-01  1.144e+00   0.782   0.4346    
## N.gravidanze      1.159e+01  4.651e+00   2.492   0.0128 *  
## Gestazione        3.846e+01  3.988e+00   9.645  < 2e-16 ***
## Lunghezza        -3.063e-01  2.203e+00  -0.139   0.8894    
## Cranio           -4.773e+00  3.192e+00  -1.495   0.1350    
## SessoM            7.316e+01  1.121e+01   6.529 8.01e-11 ***
## Lunghezza:Cranio  3.158e-02  6.531e-03   4.835 1.41e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 273.5 on 2490 degrees of freedom
##   (2 osservazioni eliminate a causa di valori mancanti)
## Multiple R-squared:  0.7296, Adjusted R-squared:  0.7289 
## F-statistic: 959.9 on 7 and 2490 DF,  p-value: < 2.2e-16

The inclusion of the Cranio x Lunghezza interaction term significantly improves the regression model, both statistically and substantively. While head circumference and body length are individually correlated with neonatal weight, their effects overlap because both capture aspects of overall fetal size. The interaction term absorbs this shared variance and reflects body proportionality more effectively than treating the two measures separately.

Statistically, the interaction is highly significant (p < 0.001), the residual standard error decreases, and the adjusted R² increases modestly (0.7289), indicating improved fit.

BIC-based model comparison and ANOVA testing are also explored to further investigate the significance of the interaction Cranio x Lunghezza.

##          df      BIC
## mod5      8 35200.87
## mod_int5  9 35185.35
## Analysis of Variance Table
## 
## Model 1: Peso ~ Anni.madre + N.gravidanze + Gestazione + Lunghezza + Cranio + 
##     Sesso
## Model 2: Peso ~ Anni.madre + N.gravidanze + Gestazione + Lunghezza + Cranio + 
##     Sesso + Cranio:Lunghezza
##   Res.Df       RSS Df Sum of Sq      F    Pr(>F)    
## 1   2491 187996688                                  
## 2   2490 186248307  1   1748380 23.375 1.415e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Model comparison criteria confirm this improvement: BIC decreases from 35200.87 (mod5) to 35185.35 (mod_int5), and ANOVA shows that the reduction in unexplained variance is unlikely due to chance (p ≈ 1.4 × 10⁻⁶). Therefore, the interaction between newborn length and head circumference provides meaningful additional explanatory power in predicting birth weight.

Clinically, all these results make sense: a proportionally large head and long body jointly represent overall fetal growth and maturity more powerfully than either measure alone. For these reasons, the Cranio x Lunghezza interaction may possibly be retained in the final model as a meaningful predictor of neonatal weight.

6.2.3 Exploration of non-linear effects

We investigate potential non-linear effects for three variables: Gestazione, Lunghezza, and Cranio. Visual inspection of correlation patterns in section 6.1 suggests that these predictors may not follow a strictly linear trend. Starting from the baseline linear model (mod5), we extended the model by adding polynomial terms for each variable individually. This approach allows us to assess whether introducing curvature significantly improves model fit, and to isolate which predictor—if any—benefits from a non-linear specification.

Non-linear effects for variable Gestazione are explored in mod_gestazione_nl.

## 
## Call:
## lm(formula = Peso ~ Anni.madre + N.gravidanze + Gestazione + 
##     Gestazione2 + Lunghezza + Cranio + Sesso, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1155.4  -180.9   -12.6   165.0  2656.6 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -4655.9158   898.7477  -5.180 2.39e-07 ***
## Anni.madre       0.9756     1.1487   0.849   0.3958    
## N.gravidanze    11.0879     4.6669   2.376   0.0176 *  
## Gestazione     -82.2342    49.7570  -1.653   0.0985 .  
## Gestazione2      1.5347     0.6625   2.317   0.0206 *  
## Lunghezza       10.3522     0.3040  34.048  < 2e-16 ***
## Cranio          10.6202     0.4287  24.772  < 2e-16 ***
## SessoM          75.6414    11.2450   6.727 2.15e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 274.5 on 2490 degrees of freedom
##   (2 osservazioni eliminate a causa di valori mancanti)
## Multiple R-squared:  0.7277, Adjusted R-squared:  0.7269 
## F-statistic: 950.5 on 7 and 2490 DF,  p-value: < 2.2e-16
##                   df      BIC
## mod5               8 35200.87
## mod_int5           9 35185.35
## mod_gestazione_nl  9 35203.31

Introducing a quadratic term for Gestazione results in a statistically significant coefficient for Gestazione2 (p = 0.0206), indicating a mild non-linear pattern. However, this effect is relatively weak compared to the other predictors in the model, whose p-values are substantially smaller. More importantly, model fit does not improve: the BIC increases from 35200.87 (mod5) to 35203.31 (mod_gestazione_nl), meaning that the added complexity is not justified. Overall, even though a small curvature can be detected, it does not meaningfully enhance explanatory or predictive performance, so Gestazione can be kept as a linear term in the final model.

Non-linear effects for variable Lunghezza are explored in mod_lunghezza_nl.

## 
## Call:
## lm(formula = Peso ~ Anni.madre + N.gravidanze + Gestazione + 
##     Lunghezza + Lunghezza2 + Cranio + Sesso, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1178.63  -181.69   -12.53   163.16  1782.32 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  180.684839 725.416864   0.249  0.80332    
## Anni.madre     0.768213   1.128384   0.681  0.49606    
## N.gravidanze  12.932944   4.590205   2.818  0.00488 ** 
## Gestazione    42.809484   3.895536  10.989  < 2e-16 ***
## Lunghezza    -20.242336   3.163266  -6.399 1.86e-10 ***
## Lunghezza2     0.031630   0.003267   9.681  < 2e-16 ***
## Cranio        10.637035   0.419499  25.357  < 2e-16 ***
## SessoM        69.905286  11.040380   6.332 2.87e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 269.7 on 2490 degrees of freedom
##   (2 osservazioni eliminate a causa di valori mancanti)
## Multiple R-squared:  0.737,  Adjusted R-squared:  0.7362 
## F-statistic: 996.7 on 7 and 2490 DF,  p-value: < 2.2e-16
##                  df      BIC
## mod5              8 35200.87
## mod_int5          9 35185.35
## mod_lunghezza_nl  9 35116.40

Adding a quadratic term for Lunghezza produces a highly significant coefficient for Lunghezza2 (p < 2e-16), indicating a strong non-linear component in the relationship between newborn length and weight. Both the linear (Lunghezza) and quadratic (Lunghezza2) terms are significant, confirming the presence of curvature and justifying their joint inclusion in the model.

Importantly, model fit improves substantially: the BIC decreases from 35200.87 (mod5) to 35116.40 (mod_lunghezza_nl), the largest improvement among the models tested so far. This shows that the added non-linear term captures meaningful structure in the data and enhances the model’s explanatory and predictive performance. Therefore, a non-linear specification for Lunghezza might be justified in the final model.

Non-linear effects for variable Cranio are explored in mod_cranio_nl.

## 
## Call:
## lm(formula = Peso ~ Anni.madre + N.gravidanze + Gestazione + 
##     Lunghezza + Cranio + Cranio2 + Sesso, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1148.51  -181.10   -14.44   164.85  2617.78 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    49.42349 1152.83339   0.043   0.9658    
## Anni.madre      0.84341    1.14140   0.739   0.4600    
## N.gravidanze   11.49937    4.63992   2.478   0.0133 *  
## Gestazione     39.19182    3.95232   9.916  < 2e-16 ***
## Lunghezza      10.48787    0.30160  34.774  < 2e-16 ***
## Cranio        -31.77255    7.17044  -4.431 9.78e-06 ***
## Cranio2         0.06257    0.01059   5.909 3.91e-09 ***
## SessoM         73.03006   11.16735   6.540 7.46e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 272.9 on 2490 degrees of freedom
##   (2 osservazioni eliminate a causa di valori mancanti)
## Multiple R-squared:  0.7309, Adjusted R-squared:  0.7301 
## F-statistic: 965.9 on 7 and 2490 DF,  p-value: < 2.2e-16
##               df      BIC
## mod5           8 35200.87
## mod_int5       9 35185.35
## mod_cranio_nl  9 35173.91

Including a quadratic term for Cranio reveals a clear non-linear pattern: both the linear (Cranio) and quadratic (Cranio2) terms are highly significant (p = 9.78e-06 and p = 3.91e-09, respectively). This indicates that the relationship between head circumference and newborn weight is curved rather than strictly linear.

Compared to the baseline model (mod5), the BIC decreases from 35200.87 to 35173.91 in the non-linear model (mod_cranio_nl), confirming that the quadratic term improves model fit. Although the improvement is smaller than the one observed for mod_lunghezza_nl, it is still substantial enough to possibly justify retaining a non-linear specification for Cranio in the final model.

Now, we extend the analysis by exploring a model that includes both non-linear effects for Lunghezza and Cranio, called mod_nl_combined.

## 
## Call:
## lm(formula = Peso ~ Anni.madre + N.gravidanze + Gestazione + 
##     Lunghezza + Lunghezza2 + Cranio + Cranio2 + Sesso, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1179.52  -181.55   -11.82   162.97  1767.22 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -1.228e+01  1.140e+03  -0.011  0.99141    
## Anni.madre    7.682e-01  1.129e+00   0.681  0.49615    
## N.gravidanze  1.295e+01  4.592e+00   2.820  0.00484 ** 
## Gestazione    4.269e+01  3.935e+00  10.849  < 2e-16 ***
## Lunghezza    -2.082e+01  4.122e+00  -5.052 4.69e-07 ***
## Lunghezza2    3.222e-02  4.230e-03   7.616 3.68e-14 ***
## Cranio        1.265e+01  9.181e+00   1.378  0.16836    
## Cranio2      -2.975e-03  1.355e-02  -0.219  0.82629    
## SessoM        6.999e+01  1.105e+01   6.334 2.82e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 269.8 on 2489 degrees of freedom
##   (2 osservazioni eliminate a causa di valori mancanti)
## Multiple R-squared:  0.737,  Adjusted R-squared:  0.7361 
## F-statistic: 871.8 on 8 and 2489 DF,  p-value: < 2.2e-16
##                  df      BIC
## mod5              8 35200.87
## mod_int5          9 35185.35
## mod_lunghezza_nl  9 35116.40
## mod_cranio_nl     9 35173.91
## mod_nl_combined  10 35124.18

In this model, only Lunghezza2 is highly significant (p = 3.68e-14), indicating a strong non-linear effect of newborn length on weight. The linear term for Lunghezza also remains significant, while both Cranio and Cranio2 are not significant, suggesting that head circumference does not contribute additional explanatory power once length is accounted for. Other predictors retain their expected effects: gestational age and sex remain strong positive predictors, and number of previous pregnancies has a modest positive impact.

Model comparison using BIC confirms that including only the quadratic effect of Lunghezza (mod_lunghezza_nl) yields the best fit (BIC = 35116.40), outperforming models with interactions (mod_int5) or additional quadratic terms for Cranio. This indicates that the non-linear relationship between body length and weight is the dominant curvature in the data, while interactions and additional quadratic terms do not meaningfully improve model performance and should not be included.

Now, we extend the analysis by exploring a model that includes both the non-linear effect of Lunghezza and the interaction between length and head cirumference. This new extended model will be called mod_int5_lunghezza_nl.

## 
## Call:
## lm(formula = Peso ~ Anni.madre + N.gravidanze + Gestazione + 
##     Sesso + Lunghezza + Lunghezza2 + Cranio + Cranio:Lunghezza + 
##     Cranio:Lunghezza2, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1184.97  -180.57   -12.25   166.10  1308.88 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       -2.169e+03  6.335e+03  -0.342  0.73209    
## Anni.madre         7.154e-01  1.128e+00   0.634  0.52600    
## N.gravidanze       1.316e+01  4.583e+00   2.872  0.00411 ** 
## Gestazione         4.070e+01  3.935e+00  10.344  < 2e-16 ***
## SessoM             7.177e+01  1.103e+01   6.507 9.27e-11 ***
## Lunghezza         -2.188e+01  2.911e+01  -0.752  0.45226    
## Lunghezza2         4.531e-02  3.320e-02   1.364  0.17254    
## Cranio             2.699e+01  1.941e+01   1.390  0.16463    
## Lunghezza:Cranio  -3.275e-02  8.734e-02  -0.375  0.70772    
## Lunghezza2:Cranio -1.848e-06  9.819e-05  -0.019  0.98499    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 269.2 on 2488 degrees of freedom
##   (2 osservazioni eliminate a causa di valori mancanti)
## Multiple R-squared:  0.7383, Adjusted R-squared:  0.7374 
## F-statistic: 779.9 on 9 and 2488 DF,  p-value: < 2.2e-16
##                       df      BIC
## mod5                   8 35200.87
## mod_int5               9 35185.35
## mod_lunghezza_nl       9 35116.40
## mod_int5_lunghezza_nl 11 35119.46

The extended model (mod_int5_lunghezza_nl) includes the quadratic effect of Lunghezza as well as interactions between Cranio and both Lunghezza and Lunghezza2, aiming to capture potential non-linear relationships and body proportionality effects. In this model, none of the newly added terms are statistically significant: the quadratic term Lunghezza2 (p = 0.173), the linear and quadratic interactions with Cranio (Cranio:Lunghezza, p = 0.708; Lunghezza2:Cranio, p = 0.985), and Cranio itself (p = 0.165) all fail to reach conventional significance levels. Meanwhile, the previously significant predictors (Gestazione, Sesso, N.gravidanze) retain their expected positive effects.

Model comparison using BIC shows that adding these interaction terms does not meaningfully improve fit: the BIC of mod_int5_lunghezza_nl (35119.46) is slightly higher than the model including only the quadratic effect of Lunghezza (mod_lunghezza_nl, 35116.40) and lower than simpler linear models (mod5 = 35200.87, mod_int5 = 35185.35).

These results indicate that the non-linear effect of Lunghezza alone captures the dominant curvature in the data, while interactions with Cranio do not contribute additional explanatory power. For parsimony and interpretability, the model including only Lunghezza2 as a non-linear term is preferred.

An ANOVA comparing the linear model (mod5) with the model including the quadratic effect of Lunghezza (mod_lunghezza_nl) is carried out.

## Analysis of Variance Table
## 
## Model 1: Peso ~ Anni.madre + N.gravidanze + Gestazione + Lunghezza + Cranio + 
##     Sesso
## Model 2: Peso ~ Anni.madre + N.gravidanze + Gestazione + Lunghezza + Lunghezza2 + 
##     Cranio + Sesso
##   Res.Df       RSS Df Sum of Sq      F    Pr(>F)    
## 1   2491 187996688                                  
## 2   2490 181177870  1   6818817 93.714 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The results show a highly significant reduction in residual variance (F = 93.71, p < 2.2 × 10⁻¹⁶). This indicates that adding the quadratic term for Lunghezza significantly improves model fit.

6.3 Final selection of the regression model

Based on the systematic exploration of all candidate models—including linear models, models with interactions, and models with non-linear terms—the model mod_lunghezza_nl is currently the preferred regression model. This model includes the predictors Anni.madre, N.gravidanze, Gestazione, Lunghezza, Lunghezza2, Cranio, and Sesso, with the quadratic term for Lunghezza capturing the non-linear relationship between body length and newborn weight.

The inclusion of the quadratic term significantly improves model fit, as supported by both the ANOVA test and the reduction in BIC compared to simpler linear models, making mod_lunghezza_nl the most parsimonious and best-performing model among those tested.

Before confirming it as the final model, we will also evaluate its predictive accuracy using RMSE and assess multicollinearity to ensure stability and interpretability.

6.3.1 RMSE

RMSE (Root Mean Squared Error) measures the average magnitude of the prediction errors in the same units as the outcome variable. Lower RMSE indicates better predictive accuracy.

RMSE for mod5:

## [1] 274.3335

RMSE for mod_int5:

## [1] 273.0549

RMSE for mod_lunghezza_nl:

## [1] 269.3124

For the candidate models, mod5 has an RMSE of 274.33 g. Including the interaction term in mod_int5 slightly reduces the RMSE to 273.05 g, indicating a modest improvement. The model mod_lunghezza_nl, which incorporates the quadratic term for Lunghezza, achieves the lowest RMSE of 269.31 g.

This shows that accounting for the non-linear effect of newborn length improves prediction more than including the linear interaction term.

6.3.2 Multicollinearity assessment

Variance Inflation Factors (VIF) were computed to evaluate multicollinearity in the model. Most predictors, including Anni.madre, N.gravidanze, Gestazione, Cranio, and Sesso, have very low VIF values (close to 1–2), indicating minimal correlation among them. As expected, Lunghezza and its quadratic term Lunghezza2 exhibit very high VIFs (> 200) due to the inherent correlation between a variable and its square.

This high VIF for Lunghezza and Lunghezza2 is a normal feature of polynomial models and does not compromise the overall stability or predictive accuracy of the model. The other predictors remain interpretable, and the quadratic term for Lunghezza captures the non-linear effect on newborn weight without introducing harmful multicollinearity.

## [1]   1.189330   1.186426   1.819041 238.038904 230.062510   1.630127   1.046128

For these reasons, mod_lunghezza_nl is selected as the optimal regression model, balancing predictive performance, interpretability, and biological plausibility.

The quadratic term for Lunghezza in mod_lunghezza_nl appears to capture the non-linear relationship between body length and birth weight more effectively than the linear interaction Cranio × Lunghezza. Biologically, this may reflect that as a newborn’s length increases, other body dimensions—such as chest, abdomen, and head circumference—also tend to grow, causing weight to increase faster than a simple linear trend. Including Lunghezza2 allows the model to better reflect this curvature, which is consistent with the lower RMSE and improved fit observed in both ANOVA and BIC comparisons.

6.4 Residuals analysis

After selecting the final regression model (mod_lunghezza_nl), it is essential to assess the model assumptions by examining the residuals. Residuals—the differences between observed and predicted values—provide insight into how well the model fits the data and whether the key assumptions of linear regression are satisfied.

In particular, the assumptions of linear regression are:

  • Linearity – The relationship between each predictor and the response is approximately linear.

  • Normality – Residuals should be normally distributed.

  • Homoscedasticity – Residuals should have constant variance across the entire range of predicted values.

  • Independence – Residuals should not be correlated with each other or with the predictor variables.

Additionally, it is important to check for extreme residuals that may disproportionately influence the regression estimates.

Below, these assumptions are explored graphically before being formally assessed through statistical tests.

A scatter plot of residuals versus fitted values shows no systematic patterns. Residuals are randomly scattered around zero, indicating that the linear model appropriately captures the trend.

The Q-Q (quantile-quantile) plot of residuals shows points roughly following the diagonal line. Minor deviations occur at the extremes, but overall, residuals appear approximately normally distributed.

The Scale-Location plot (square root of standardized residuals versus fitted values) stabilizes the variance and makes patterns more apparent. Here, the roughly horizontal band of points indicates homoscedasticity, while funnel- or cone-shaped patterns would suggest heteroscedasticity.

The Residuals vs. Leverage plot shows that point 1551 exceeds Cook’s distance thresholds, suggesting that it may represent a highly influential observation affecting the model disproportionately.

To complement the graphical evaluation, the assumptions of the final regression model mod_lunghezza_nl are also assessed using formal statistical tests:

  • Normality of residuals – Tested using the Shapiro-Wilk test:
## 
##  Shapiro-Wilk normality test
## 
## data:  residuals(mod_lunghezza_nl)
## W = 0.98573, p-value = 3.888e-15

The very low p-value indicates a statistically significant departure from normality, so the hypothesis of normal distribution of the residuals must be refused. However, this is common with large sample sizes [1]. Previous graphical inspection (Q-Q plot) suggests that the deviation is mostly due to extreme residuals, while the majority of residuals appear approximately normally distributed.

Moreover, the density curve of the residuals is fairly symmetric and centered around zero, although it shows a slightly longer right tail.

  • Homoscedasticity – Tested with the studentized Breusch-Pagan test:
## 
##  studentized Breusch-Pagan test
## 
## data:  mod_lunghezza_nl
## BP = 129.06, df = 7, p-value < 2.2e-16

The significant p-value suggests the presence of heteroscedasticity, indicating that residual variance may increase for certain ranges of predicted values. The hypothesis of homoscedasticity must be refused.

  • Independence of residuals – Tested with the Durbin-Watson test:
## 
##  Durbin-Watson test
## 
## data:  mod_lunghezza_nl
## DW = 1.9468, p-value = 0.09152
## alternative hypothesis: true autocorrelation is greater than 0

The non-significant p-value indicates no evidence of positive autocorrelation, so residuals can be considered independent.

Overall, although mod_lunghezza_nl shows some deviations from normality and heteroscedasticity, it satisfies the independence assumption and captures the main predictors effectively. Given its parsimony, and interpretability, it remains the most suitable model for this dataset among those explored.

6.5 Leverages analysis

Leverage measures how far an observation’s predictor values are from the mean of all predictor values. Observations with high leverage have the potential to exert strong influence on the regression line, even if their residuals are small.

To assess leverage in mod_int5, we computed the hat values for all observations and plotted them. A horizontal threshold line at 2p/n was added, where (p) is the number of predictors and (n) is the sample size.

A total of 133 observations lie above this threshold, indicating they have higher leverage than typical points. These observations warrant attention, as they could potentially influence the regression coefficients more strongly than the majority of the data.

6.6 Outliers analysis

Outliers are observations with unusually large residuals, which can disproportionately affect the regression estimates. To identify potential outliers, the studentized residuals are explored, which standardize residuals to account for their differing variances.

A scatter plot of the studentized residuals was created, with horizontal reference lines at (-2) and (2) to highlight observations with unusually large residuals. Additionally, a formal Bonferroni-adjusted outlier test was performed using the car::outlierTest function.

##       rstudent unadjusted p-value Bonferroni p
## 1551 11.163329         2.8581e-28   7.1395e-25
## 155   5.116418         3.3514e-07   8.3719e-04
## 1306  4.791893         1.7493e-06   4.3698e-03

The analysis identified three highly significant outliers:

  • Observation 1551 (r_student = 11.16), Bonferroni p < 1e-24)

  • Observation 155 (r_student = 5.12), Bonferroni p = 8.37e-04)

  • Observation 1306 (r_student = 4.79), Bonferroni p = 4.37e-03)

These points exhibit exceptionally large residuals and should be carefully considered. While they may represent data errors, they could also reflect genuine extreme cases in the population.

6.7 Cook’s distance analysis

Cook’s distance is a measure of the influence of each observation on the fitted regression coefficients. It takes into account both the leverage of the observation (its distance in the predictor space) and the size of its residual (difference between observed and predicted values), capturing the overall influence on the regression estimates. Observations with large Cook’s distance values can disproportionately affect the model and may warrant further investigation.

For mod_lunghezza_nl, Cook’s distances were computed for all observations, and a threshold of 0.5 was considered to identify potentially influential points. The maximum Cook’s distance is 1.267, corresponding to Observation 1551, which exceeds the threshold, indicating that at least one observation has substantial influence on the model.

The plot below displays Cook’s distances for all observations, with a horizontal dashed line at 0.5 marking the threshold:

7 Predictions

Once the final regression model (mod_lunghezza_nl) is selected, it can be used to estimate the expected neonatal weight (Peso) for hypothetical or real cases based on specific values of the predictors. Predictions are computed by inputting the chosen values for each predictor into the model equation, taking into account both main effects and interactions.

Three illustrative examples were considered:

CASE 1) Female newborn, typical maternal and gestational characteristics

  • Anni.madre = 28, N.gravidanze = 3, Gestazione = 39, Lunghezza = 495, Cranio = 340, Sesso = “F

  • Predicted weight: 3257 g

Anni.madre.predict     <- 28
N.gravidanze.predict   <- 3
Gestazione.predict     <- 39
Lunghezza.predict      <- 495
Cranio.predict         <- 340 
Sesso.predict          <- "F"


predict_df <- data.frame(
  Anni.madre  = Anni.madre.predict,
  N.gravidanze = N.gravidanze.predict,
  Gestazione  = Gestazione.predict,
  Lunghezza   = Lunghezza.predict,
  Lunghezza2  = Lunghezza.predict^2,  
  Cranio      = Cranio.predict,
  Sesso       = Sesso.predict
)

peso.predict <- round(predict(mod_lunghezza_nl, newdata = predict_df), 0)
##    1 
## 3257

CASE 2) Male newborn with same maternal and gestational characteristics

  • Sesso = “M” (other predictors identical to the first case)

  • Predicted weight: 3327 g

  • The difference reflects the strong positive effect of male sex (SessoM) on neonatal weight in the model.

Anni.madre.predict     = 28
N.gravidanze.predict   = 3
Gestazione.predict     = 39
Lunghezza.predict      = 495
Cranio.predict         = 340 
Sesso.predict          = "M"



predict_df = data.frame(Anni.madre=Anni.madre.predict,
                        N.gravidanze=N.gravidanze.predict,
                        Gestazione=Gestazione.predict,
                        Lunghezza=Lunghezza.predict,
                        Lunghezza2  = Lunghezza.predict^2,
                        Cranio=Cranio.predict,
                        Sesso=Sesso.predict
                        )           


peso.predict = round(predict(mod_lunghezza_nl, newdata = predict_df), 0)
##    1 
## 3327

CASE 3) Female newborn with higher gestational age and first pregnancy

  • Anni.madre = 28, N.gravidanze = 3, Gestazione = 41, Lunghezza = 495, Cranio = 340, Sesso = “F”

  • Predicted weight: 3343 g

  • Here, the increase in gestational age produces a higher predicted weight compared to the first example.

Anni.madre.predict     = 28
N.gravidanze.predict   = 3
Gestazione.predict     = 41
Lunghezza.predict      = 495
Cranio.predict         = 340 
Sesso.predict          = "F"



predict_df = data.frame(Anni.madre=Anni.madre.predict,
                        N.gravidanze=N.gravidanze.predict,
                        Gestazione=Gestazione.predict,
                        Lunghezza=Lunghezza.predict,
                        Lunghezza2  = Lunghezza.predict^2,
                        Cranio=Cranio.predict,
                        Sesso=Sesso.predict
                        )           


peso.predict = round(predict(mod_lunghezza_nl, newdata = predict_df), 0)
##    1 
## 3343

8 Graphical Exploration of Model Variables

Visualizations play a crucial role in complementing the statistical analysis performed in the previous sections. In this chapter, a series of plots is presented to illustrate how gestational time, anthropometric measures, and sex relate to newborn weight, and to visually confirm several of the trends identified through modelling.

In the plots below, the horizontal lines represent weight median and IQR values to support visual interpretation of the results.

8.1 Plot 1 - Newborn weight vs Gestational Time by Sex

The first visualization explores the relationship between gestational age (Gestazione) and newborn weight (Peso), stratified by sex (Sesso). The scatter plot displays all individual observations, while linear smoothing curves highlight the trend separately for males and females and an overall trend for the entire sample.

Comments:

  • Weight increases with gestational age, showing the expected positive developmental trend.

  • Male newborns tend to weigh more than females at every gestational age, with two visibly distinct regression lines.

  • All newborns delivered at 33 weeks or earlier fall below the first quartile (Q1) of the weight distribution, indicating systematically lower weight due to prematurity.

  • Newborns delivered between 37 and 41 weeks—the typical full-term interval—show weights concentrated around the median, reflecting normal intrauterine growth.

  • Outliers are present, including full-term newborns (41–42 weeks) whose weight remains below Q1, suggesting atypically low growth despite prolonged gestation.

8.2 Plot 2 — Newborn Weight vs Gestational Time by Maternal Smoking Status

This second visualization examines whether maternal smoking (Fumatrici) is associated with differences in newborn weight across gestational ages.

Comments:

  • The number of observations is strongly unbalanced: smokers are far fewer than non-smokers, and their data do not span the full range of gestational ages.

  • As a consequence, the regression line for smokers is shorter and less reliable, while the overall regression line is driven mainly by the numerous non-smoker observations, nearly overlapping with the non-smoker trend.

  • Within the limited gestational range where smoker data are available, the slope of the smoker regression line is smaller, suggesting that among smokers, newborn weight increases more slowly with gestational age.

  • However, the linear model explored earlier in section 6 showed that the variable Fumatrici, indicating that these graphical differences should be interpreted with caution and may not reflect a meaningful effect once other variables are accounted for.

8.3 Plot 3 — Newborn Weight vs Length

This plot explores the relationship between newborn length and weight, illustrating how body size at birth relates to overall mass.

Comments:

  • Weight increases with newborn length, showing a clear positive relationship.

  • The scatter plot appears approximately linear, supporting the linearity assumption.

  • Regression lines for males and females are generally pretty close. At lower lengths, the female line is slightly above the male line, but they cross around 420 mm length due to the steeper slope of the male.

  • Most observations are concentrated around 450–540 mm in length, representing the bulk of the data.

8.4 Plot 4 — Newborn Weight vs Head Circumference

This plot explores the relationship between newborn head circumference and weight, illustrating how body size at birth relates to overall mass.

Comments:

  • Weight increases with newborn head circumference, indicating a strong positive association.

  • Most observations fall below the regression lines at low head circumference values.

  • Scatter is densest between 320–360 mm, where most weights are concentrated around the IQR.

  • Regression lines for males and females are clearly separated, with males consistently showing higher weights.

10 Appendix

This analysis was conducted using R version 4.5.1.