At Neonatal Health Solutions, our goal is to enhance the quality of prenatal and neonatal care through data-driven insights.
This analysis aims to develop a predictive model for newborn birth weight, leveraging clinical and demographic variables collected from three hospitals. Specifically, the analysis aims to:
Improve clinical predictions of neonatal outcomes to enable timely interventions.
Optimize hospital resources by anticipating which newborns may require intensive care.
Identify and prevent risk factors affecting birth weight and pregnancy outcomes.
Evaluate hospital practices to ensure consistent and high-quality care across centers.
Support strategic planning through data-driven insights that enhance neonatal health policies.
The project represents a step toward our broader mission: using statistical modeling to promote healthier births and more informed medical practices.
The dataset used for this analysis contains information on 2,500 newborns collected from three hospitals. It includes both maternal and neonatal clinical variables that may influence birth weight.
Key metrics include maternal age, number of pregnancies, maternal smoking status, gestational duration, newborn length, cranial diameter, and weight, type of delivery, hospital of birth, and sex of the newborn.
The following table shows the first few rows of the dataset, providing a quick glimpse of the key numerical metrics.
| Table 1. Dataset Preview | |||||||||
| Anni.madre | N.gravidanze | Fumatrici | Gestazione | Peso | Lunghezza | Cranio | Tipo.parto | Ospedale | Sesso |
|---|---|---|---|---|---|---|---|---|---|
| 26 | 0 | 0 | 42 | 3380 | 490 | 325 | Nat | osp3 | M |
| 21 | 2 | 0 | 39 | 3150 | 490 | 345 | Nat | osp1 | F |
| 34 | 3 | 0 | 38 | 3640 | 500 | 375 | Nat | osp2 | M |
| 28 | 1 | 0 | 41 | 3690 | 515 | 365 | Nat | osp2 | M |
| 20 | 0 | 0 | 38 | 3700 | 480 | 335 | Nat | osp3 | F |
| 32 | 0 | 0 | 40 | 3200 | 495 | 340 | Nat | osp2 | F |
Below is a detailed description of each variable, including its type and statistical classification.
| Table 2. Variable Description | ||
| Variable Name | Meaning | Classification |
|---|---|---|
| Anni.madre | Mother's age | Quantitative discrete |
| N.gravidanze | Number of previous pregnancies | Quantitative discrete |
| Fumatrici | Mother's smoking status (0 = no, 1 = yes) | Qualitative nominal |
| Gestazione | Weeks of gestation | Quantitative discrete |
| Peso | Newborn's weight in grams | Quantitative continuous |
| Lunghezza | Newborn's length in millimeters | Quantitative continuous |
| Cranio | Newborn's head circunference in millimeters | Quantitative continuous |
| Tipo.parto | Type of delivery (natural or cesarean) | Qualitative nominal |
| Ospedale | Hospital code | Qualitative nominal |
| Sesso | Newborn's sex (M/F) | Qualitative nominal |
This chapter summarizes the key statistical properties of both quantitative and qualitative variables in the dataset. For quantitative variables, measures of position, variability, and shape are calculated. For qualitative variables, the frequency distribution of observations across categories is examined.
The following table shows an overview of the quantitative variables in the dataset.
| Table 3. Quantitative Variables in the Dataset | |||||
| Anni.madre | N.gravidanze | Gestazione | Peso | Lunghezza | Cranio |
|---|---|---|---|---|---|
| 26 | 0 | 42 | 3380 | 490 | 325 |
| 21 | 2 | 39 | 3150 | 490 | 345 |
| 34 | 3 | 38 | 3640 | 500 | 375 |
| 28 | 1 | 41 | 3690 | 515 | 365 |
| 20 | 0 | 38 | 3700 | 480 | 335 |
| 32 | 0 | 40 | 3200 | 495 | 340 |
Below is a summary of the main position indexes for the quantitative variables.
| Table 4. Position Indexes | |||||||
| Variable | Mean | Mode | Min | Q1 | Median | Q3 | Max |
|---|---|---|---|---|---|---|---|
| Anni.madre | 28.16 | 30 | 0.00 | 25.00 | 28.00 | 32.00 | 46.00 |
| N.gravidanze | 0.98 | 0 | 0.00 | 0.00 | 1.00 | 1.00 | 12.00 |
| Gestazione | 38.98 | 40 | 25.00 | 38.00 | 39.00 | 40.00 | 43.00 |
| Peso | 3,284.08 | 3300 | 830.00 | 2,990.00 | 3,300.00 | 3,620.00 | 4,930.00 |
| Lunghezza | 494.69 | 500 | 310.00 | 480.00 | 500.00 | 510.00 | 565.00 |
| Cranio | 340.03 | 340 | 235.00 | 330.00 | 340.00 | 350.00 | 390.00 |
The summary highlights several points of interest.
The minimum maternal age is recorded as 0 years, which is clearly an error and will need correction. The mean maternal age is 28.16 years, while the most common age (mode) is 30 years. About 25% of mothers are 25 years old or less, while 75% are 32 years old or less. The oldest recorded mother is 46 years old.
The number of pregnancies ranges from 0 to 12, suggesting the presence of exceptional cases. The mode is 0, indicating that most mothers are experiencing their first pregnancy. Additionally, 25% of women report no previous pregnancies, while 75% report one or fewer.
The gestational age has a mode of 40 weeks, which aligns with the typical full-term duration. Values below 37 weeks may indicate premature births, with the lowest gestational age recorded being 25 weeks.
The newborn’s weight ranges from 830 g to 4,930 g. The mean weight is 3,284 g, while the median is 3,300 g. The minimum value likely reflects a severely premature birth.
Length ranges from 310 mm to 565 mm, and head circumference from 235 mm to 390 mm, with very low values probably suggesting prematures births or other cases.
Below is a summary of the main variability indexes for the quantitative variables.
| Table 5. Variability Indexes | ||||
| Variable | StdDev | Range | IQR | CV |
|---|---|---|---|---|
| Anni.madre | 5.27 | 46.00 | 7.00 | 18.72% |
| N.gravidanze | 1.28 | 12.00 | 1.00 | 130.51% |
| Gestazione | 1.87 | 18.00 | 2.00 | 4.79% |
| Peso | 525.04 | 4,100.00 | 630.00 | 15.99% |
| Lunghezza | 26.32 | 255.00 | 30.00 | 5.32% |
| Cranio | 16.43 | 155.00 | 20.00 | 4.83% |
The Coefficient of Variation (CV) provides a standardized measure of dispersion, allowing comparisons of variability across variables with different scales.
Among the variables, number of pregnancies (CV = 130.5%) shows by far the highest relative variability, reflecting a highly dispersed distribution affected by several extreme values. Maternal age (CV = 18.7%) and newborn weight (CV = 16.0%) display moderate variability, suggesting some heterogeneity within the sample.
In contrast, gestational age (CV = 4.8%), length (CV = 5.3%), and head circumference (CV = 4.8%) exhibit low relative variability, indicating that these measures are relatively less dispersed across observations.
Below is a summary of the main shape indexes for the quantitative variables.
| Table 6. Shape Indexes | ||
| Variable | Skewness | Kurtosis |
|---|---|---|
| Anni.madre | 0.04 | 0.38 |
| N.gravidanze | 2.51 | 10.99 |
| Gestazione | −2.07 | 8.26 |
| Peso | −0.65 | 2.03 |
| Lunghezza | −1.51 | 6.49 |
| Cranio | −0.79 | 2.95 |
The shape analysis highlights differences in symmetry and tail behavior among the variables.
Maternal age has near-zero skewness (0.04) and a slightly positive kurtosis (0.38), indicating an almost symmetric and mildly leptokurtic distribution.
Number of pregnancies shows strong positive skewness (2.51) and high positive kurtosis (10.99), suggesting a right-skewed and sharply leptokurtic distribution, with most women having few pregnancies and a small number with very high counts.
Gestational age displays negative skewness (−2.07) and high positive kurtosis (8.26), meaning it is left-skewed and leptokurtic, with most pregnancies concentrated around full term and a few markedly shorter durations.
Newborn weight, length, and head circumference all exhibit negative skewness (0.65, -1.51, and -0.79 respectivley) and positive kurtosis (2.03, 6.49, and 2.95 respectively), indicating asymmetric distributions with longer left tails and peaked shapes relative to the normal distribution.
Outlier detection is an essential step to identify unusual or extreme observations that may affect the accuracy of the statistical analysis and model estimation. These values can result from data entry errors, measurement inaccuracies, or genuine but rare clinical cases.
Below is a boxplot representation for all quantitative variables, where observations identified as outliers are highlighted in red. This visualization provides an overview of the variability within each variable and helps assess the presence and extent of extreme values in the dataset.
Below is a summary table reporting, for each quantitative variable, the number and values of detected outliers.
| Table 7. Outliers summary | ||
| Variable | Outliers_count | Outliers_values |
|---|---|---|
| Anni.madre | 13 | 0, 1, 13, 14, 43, 44, 45, 46 |
| N.gravidanze | 246 | 3, 4, 5, 6, 7, 8, 9, 10, 11, 12 |
| Gestazione | 67 | 25, 26, 27, 28, 29, 30, 31, 32, 33, 34 |
| Peso | 69 | 830, 900, 930, 980, 990, 1140, 1170, 1180, 1190, 1280, 1285, 1300, 1340, 1370, 1390, 1410, 1430, 1450, 1500, 1550, 1560, 1580, 1600, 1615, 1620, 1690, 1720, 1730, 1750, 1770, 1780, 1800, 1840, 1850, 1890, 1900, 1950, 1960, 1970, 1980, 2000, 2040, 4580, 4600, 4620, 4650, 4680, 4690, 4700, 4720, 4760, 4810, 4900, 4930 |
| Lunghezza | 59 | 310, 315, 320, 325, 340, 345, 355, 360, 370, 380, 385, 390, 400, 405, 410, 420, 425, 430, 560, 565 |
| Cranio | 48 | 235, 245, 253, 254, 265, 266, 267, 270, 272, 273, 274, 275, 276, 277, 278, 280, 285, 287, 289, 290, 292, 293, 294, 295, 297, 298, 299, 381, 382, 383, 384, 385, 386, 390 |
The table highlights several notable aspects. Maternal age shows 13 outliers, including implausible values such as 0 and 1, which most likely result from data entry errors.
Number of pregnancies presents a large number of outliers (246 cases), primarily because both the first and third quartiles are equal to 1, meaning that any value from 3 and above is statistically classified as an outlier.
Gestational age includes 67 outliers, mainly corresponding to very early deliveries between 25 and 33 weeks.
For newborn weight, 69 outliers were identified, concentrated at both extremes — below 2000 g and above 4500 g. These observations likely reflect premature or macrosomic births rather than data errors.
Length and head circumference also display several outliers, particularly at the tails of their distributions, possibly related to the same clinical conditions influencing birth weight.
Overall, while some outliers appear to stem from recording inaccuracies, others represent genuine biological variability. Based on these considerations, maternal age values equal to 0 or 1 have been recoded as NA for data consistency.
In this section, we examine the distribution of each quantitative variable through three complementary perspectives.
First, the frequency class distribution summarizes how observations are distributed across intervals, allowing the identification of potential asymmetries or clustering of values.
Next, the density plots provide a visual representation of the continuous distribution of each variable, helping to assess the overall shape and detect deviations from normality such as skewness or heavy tails.
Finally, the Shapiro–Wilk test formally evaluates the assumption of normality for each variable.
##
## Shapiro-Wilk normality test
##
## data: df$Peso
## W = 0.97066, p-value < 2.2e-16
The Shapiro–Wilk test for the variable Peso (newborn
weight) returned a p-value < 2.2×10⁻¹⁶, far beloved the 0.05
threshold. Although the null hypothesis of normality is statistically
rejected, this result should be interpreted with caution given the large
sample size. With over 2,000 observations, even small deviations from
normality tend to yield significant p-values. The slight
departure from normality is also consistent with the mild negative
skewness observed earlier, likely due to a subset of low-birth-weigh
##
## Shapiro-Wilk normality test
##
## data: df$Anni.madre
## W = 0.99491, p-value = 1.477e-07
##
## Shapiro-Wilk normality test
##
## data: df$Gestazione
## W = 0.83328, p-value < 2.2e-16
##
## Shapiro-Wilk normality test
##
## data: df$Lunghezza
## W = 0.90941, p-value < 2.2e-16
##
## Shapiro-Wilk normality test
##
## data: df$Cranio
## W = 0.96357, p-value < 2.2e-16
Overall, none of the quantitative variables pass the Shapiro–Wilk normality test. However, this result should be interpreted with caution, as the large number of observations can strongly influence the test outcome, making even small deviations from normality appear statistically significant. This limitation is well-documented in statistical literature [1].
The figure below displays the relative frequencies of the three
categorical variables in the dataset: Sesso (newbonr’s
sex), Fumatrici (maternal smoking status), and
Ospedale (hospital ID). These distributions provide an
overview of how observations are divided among categories, offering
useful context for understanding group proportions.
There is a clear disparity, with non-smoking mothers representing over 96% of the sample. This suggests that smoking during pregnancy is relatively rare within this dataset.
The births appear to be almost evenly distributed among the three hospitals, with a slightly higher proportion recorded in Ospedale 3. This indicates that the data collection is reasonably balanced across institutions.
The distribution between male and female newborns is nearly balanced, indicating no substantial sex bias in the sample.
In this section, we conduct a series of statistical tests to validate specific hypotheses derived from the study objectives. In particular, we assess:
Whether the proportion of cesarean deliveries differs significantly across hospitals;
Whether the mean weight and length of newborns in this sample are statistically equal to those observed in the general population;
Whether anthropometric measures (weight, length, and head circumference) differ significantly between male and female newborns.
Each test is performed using the most appropriate statistical method, depending on the type and distribution of the variables involved.
The first hypothesis evaluates whether the three hospitals in the
dataset differ in the proportion of cesarean deliveries. To investigate
this, we examine the association between delivery type
(Tipo.parto) and hospital (Ospedale) using the
chi-square test of independence.
This statistical test assesses whether two categorical variables are independent—meaning that the distribution of one variable does not change across the levels of the other. In this context, it verifies whether the frequency of cesarean versus natural births is consistent across all hospitals, or whether certain hospitals perform cesarean deliveries more or less frequently than others.
The hypotheses of the chi-square test are:
Null hypothesis (H₀):
The variables Tipo.parto and Ospedale are
independent.
This means that the mode of delivery does not depend on the
hospital.
Alternative hypothesis (H₁):
The variables Tipo.parto and Ospedale are
not independent.
This implies that the mode of delivery does depend on the hospital.
The figure below shows how delivery types are distributed across the three hospitals.
From a visual inspection of the frequency distribution, it is clear
that natural delivery is by far the most frequent type of delivery in
each hospital. However, based solely on the frequency distribution, it
is not possible to determine whether there is a statistical dependency
between Tipo.parto and Ospedale.
Below are presented the results of the chi-square test.
##
## Pearson's Chi-squared test
##
## data: contingency_tab
## X-squared = 1.0972, df = 2, p-value = 0.5778
The obtained p-value is 0.5778. Considering a significance level of
5%, this p-value falls well within the region of acceptance for the null
hypothesis. Therefore, we do not have sufficient evidence to reject the
null hypothesis, and we conclude that the categorical variables
Tipo.parto and Ospedale are statistically
independent.
This subchapter evaluates whether the mean weight (Peso)
and length (Lunghezza) of the newborns in the sample differ
significantly from the population averages. The one-sample
Student’s t-test is applied, which tests whether the mean of a
sample is statistically equal to a known population mean, as both
p-values are above the 5% significance level.
The t-test is suitable for variables that are approximately continuous and measured on an interval scale. It compares the observed sample mean with the hypothesized population mean while accounting for sample variability.
The reference population values are:
The system of hypotheses for each variable is:
Null hypothesis (H₀):
The mean of the sample is equal to the population mean.
Alternative hypothesis (H₁):
The mean of the sample is different from the population mean
Separate t-tests are conducted for Peso and
Lunghezza to statistically evaluate these hypotheses. Below
are the results.
##
## One Sample t-test
##
## data: df$Peso
## t = -1.516, df = 2499, p-value = 0.1296
## alternative hypothesis: true mean is not equal to 3300
## 95 percent confidence interval:
## 3263.490 3304.672
## sample estimates:
## mean of x
## 3284.081
##
## One Sample t-test
##
## data: df$Lunghezza
## t = -0.58514, df = 2499, p-value = 0.5585
## alternative hypothesis: true mean is not equal to 495
## 95 percent confidence interval:
## 493.6598 495.7242
## sample estimates:
## mean of x
## 494.692
The one-sample t-tests indicate that the sample means for both
Peso and Lunghezza do not differ significantly
from the population averages.
For Peso, the sample mean is 3284.08 g, compared to the
population mean of 3300 g. The test yields a t-value of -1.516 and a
p-value of 0.1296. Since the p-value exceeds the common significance
level of 0.05, the null hypothesis cannot be rejected. The 95%
confidence interval (3263.49–3304.67 g) includes the population mean,
further confirming no significant difference.
For Lunghezza, the sample mean is 494.69 mm, with a
t-value of -0.585 and a p-value of 0.5585. Again, the null hypothesis
cannot be rejected. The 95% confidence interval (493.66–495.72 mm)
encompasses the population mean of 495 mm.
Overall, both tests support that the sample is consistent with the general population in terms of birth weight and length.
This analysis investigates whether there are statistically
significant differences in newborn anthropometric measures
(Peso, Lunghezza, and Cranio)
between male and female newborns.
The table below summarizes the means of Anthropometric Measures by Sex in the sample.
| Table 8. Mean of Anthropometric Variables by Sex | |||
| Sesso | Peso | Lunghezza | Cranio |
|---|---|---|---|
| F | 3,161.13 | 489.76 | 337.63 |
| M | 3,408.22 | 499.67 | 342.45 |
Independent two-sample Student’s t-tests are applied to compare the means of each measure across the two groups.
The Welch test differs from the traditional Student’s t-test in that it does not assume equal variances between the two groups. This makes it more robust and reliable when the sample variances or sample sizes are unequal — a common situation in real datasets. R uses the Welch method by default precisely because it provides more accurate Type I error control under variance heterogeneity.
For each variable, the hypothesis system is:
Null hypothesis (H₀):
The mean of the measure is statistically equal between males and
females.
Alternative hypothesis (H₁):
The mean of the measure statistically differs between males and
females.
Starting with the anthropometric variable of newborns’ weight, the scatter plot below displays the distribution of weights by sex. From a visual perspective, the plot suggests that males tend to have higher weights than females.
To assess whether this apparent difference is statistically significant, a two-sample t-test is performed to evaluate whether the mean weight is the same for both groups or whether a true difference exists.
##
## Welch Two Sample t-test
##
## data: df$Peso[df$Sesso == "M"] and df$Peso[df$Sesso == "F"]
## t = 12.106, df = 2490.7, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 207.0615 287.1051
## sample estimates:
## mean of x mean of y
## 3408.215 3161.132
The Welch two-sample t-test for the variable Peso
(newborn’s weight) shows a statistically significant difference between
male and female infants (p-value < 2.2e-16). The 95% confidence
interval for the difference in mean weights ranges from 207.1 g to 287.1
g. Statistically, this interval is significant because it does
not include 0, meaning that a difference of zero is not
plausible based on the sample data. Therefore, the test strongly
supports the conclusion that male newborns are, on average, heavier than
female newborns.
The analysis then turns to newborns’ length. The boxplot below shows the distribution by sex and again suggests that males tend to have higher values than females.
##
## Welch Two Sample t-test
##
## data: df$Lunghezza[df$Sesso == "M"] and df$Lunghezza[df$Sesso == "F"]
## t = 9.582, df = 2459.3, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 7.876273 11.929470
## sample estimates:
## mean of x mean of y
## 499.6672 489.7643
The Welch two-sample t-test for Lunghezza (newborn’s
length) reveals a significant difference in mean length between male and
female newborns. Males are, on average, longer (499.7 mm) compared to
females (489.8 mm). The p-value (< 2.2e-16) confirms that the
difference is highly statistically significant. The 95% confidence
interval for the mean difference, ranging from 7.88 mm to 11.93 mm,
further supports the presence of a consistent sex-related effect on
newborn length. Because the interval excludes 0, the null hypothesis of
equal means is rejected.
Finally, the analysis focuses on newborn’s head circumference, with the boxplot suggesting once again that males tend to have higher values.
##
## Welch Two Sample t-test
##
## data: df$Cranio[df$Sesso == "M"] and df$Cranio[df$Sesso == "F"]
## t = 7.4102, df = 2491.4, p-value = 1.718e-13
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 3.541270 6.089912
## sample estimates:
## mean of x mean of y
## 342.4486 337.6330
For Cranio (newborn’s head circumference), the Welch
two-sample t-test indicates that male newborns have larger head
circumferences (342.4 mm) than females (337.6 mm). The p-value
(1.718e-13) is well below the 5% threshold, allowing rejection of the
null hypothesis of equal means. The 95% confidence interval for the
difference in head circumference ranges from 3.54 mm to 6.09 mm,
confirming a statistically significant and biologically relevant
difference between sexes. Because the interval excludes 0, the null
hypothesis of equal means between males and females is rejected.
Overall, all statistical tests conducted on the anthropometric variables indicate significant differences in mean values between male and female newborns. This confirms that sex is an important factor and should be included as a relevant predictor in the construction of the linear regression model.
The aim of this chapter is to develop a statistical model capable of
predicting the newborn’s weight (Peso) using the clinical
and demographic variables available in the dataset.
To build an effective and interpretable predictive model, a stepwise regression strategy will be adopted. This approach allows variables to be added or removed based on their statistical contribution, leading to a parsimonious model that retains only the most informative predictors.
Before implementing the stepwise procedure, it is essential to examine the relationships among the quantitative variables. In particular, correlation analysis provides an initial screening of how strongly each predictor is associated with the newborn’s weight, while also revealing potential issues such as multicollinearity.
The pairwise correlation matrix below reports both the correlation coefficients and the corresponding scatter plots for all quantitative variables.
A first observation concerns the anthropometric measures. As
expected, the strongest correlations with newborn weight
(Peso) are found for Lunghezza (0.80) and
Cranio (0.70). These strong positive associations are also
visually apparent in the scatter plots, which show a clear upward linear
trend. The high correlation with Gestazione (0.63) further
indicates that weight tends to increase with gestational age.
In contrast, the correlations between Peso and
Anni.madre as well as between Peso and
N.gravidanze are close to zero. The corresponding scatter
plots display widely dispersed points with nearly horizontal fitted
lines, suggesting no meaningful linear relationship between these
variables and newborn weight.
The analysis also highlights substantial positive correlations among
the predictors—notably between Lunghezza and
Cranio (0.60) and between Lunghezza and
Gestazione (0.62). These associations indicate that
multicollinearity may be present among the anthropometric and
gestational variables, and therefore must be checked in the subsequent
regression modeling phase.
For categorical variables, calculating correlation with
Peso is not appropriate. Instead, their relationship with
newborn weight will be explored visually using boxplots for each
categorical variable.
The first boxplot displays newborn’s weight (Peso) by
sex (Sesso). As confirmed by the t-test in Section 5.3,
there is a statistically significant difference in mean weight between
male and female newborns, with the test results leading to the rejection
of the null hypothesis.
Below is the boxplot of Peso by mother’s smoking status
(Fumatrici). Visually, no substantial difference in newborn
weight is apparent between smokers and non-smokers. This observation is
supported by an independent two-sample t-test, which yields a p-value of
0.3033—well above the 5% significance threshold—indicating that the null
hypothesis of equal means cannot be rejected. Therefore, there is no
statistically significant difference in Peso based on
maternal smoking status, although this relationship will be further
evaluated in the regression modeling phase.
##
## Welch Two Sample t-test
##
## data: Peso by Fumatrici
## t = 1.034, df = 114.1, p-value = 0.3033
## alternative hypothesis: true difference in means between group Non-smoker and group Smoker is not equal to 0
## 95 percent confidence interval:
## -45.61354 145.22674
## sample estimates:
## mean in group Non-smoker mean in group Smoker
## 3286.153 3236.346
Finally, the boxplot of Peso by Ospedale is
examined. Visually, the boxplots do not indicate any substantial
differences in newborn weight across the three hospitals.
To construct an effective predictive model for newborn weight
(Peso), a stepwise regression approach is employed. This
iterative procedure systematically adds or removes predictor variables
based on their statistical contribution to the model, typically measured
via criteria such as the Akaike Information Criterion (AIC) or Bayesian
Information Criterion (BIC).
The stepwise method balances model complexity and predictive accuracy, helping to identify a parsimonious set of predictors while avoiding overfitting. Both main effects and potential interaction terms are considered, ensuring that relevant relationships between variables are captured in the final model.
Below, the initial regression model is presented, representing the
complete model that includes all independent variables (predictors).
This model will be referred to as
mod1.
##
## Call:
## lm(formula = Peso ~ ., data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1123.26 -181.53 -14.45 161.05 2611.89
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6735.7960 141.4790 -47.610 < 2e-16 ***
## Anni.madre 0.8018 1.1467 0.699 0.4845
## N.gravidanze 11.3812 4.6686 2.438 0.0148 *
## FumatriciSmoker -30.2741 27.5492 -1.099 0.2719
## Gestazione 32.5773 3.8208 8.526 < 2e-16 ***
## Lunghezza 10.2922 0.3009 34.207 < 2e-16 ***
## Cranio 10.4722 0.4263 24.567 < 2e-16 ***
## Tipo.partoNat 29.6335 12.0905 2.451 0.0143 *
## Ospedaleosp2 -11.0912 13.4471 -0.825 0.4096
## Ospedaleosp3 28.2495 13.5054 2.092 0.0366 *
## SessoM 77.5723 11.1865 6.934 5.18e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 274 on 2487 degrees of freedom
## (2 osservazioni eliminate a causa di valori mancanti)
## Multiple R-squared: 0.7289, Adjusted R-squared: 0.7278
## F-statistic: 668.7 on 10 and 2487 DF, p-value: < 2.2e-16
The initial regression model (mod1) includes all
independent variables to predict newborn weight (Peso). The
model explains a substantial portion of the variability in weight, with
an adjusted R-squared of 0.7278, indicating that approximately 73% of
the variation in Peso is accounted for by the
predictors.
Among the predictors, Gestazione,
Lunghezza, Cranio, and SessoM
show very strong statistical significance (p < 0.001), indicating
that longer gestation, larger anthropometric measures, and male sex are
associated with higher birth weight. N.gravidanze,
Tipo.partoNat, and Ospedaleosp3 are also
significant at the 5% level, though their effect sizes are smaller.
Anni.madre, FumatriciSmoker, and
Ospedaleosp2 are not statistically significant.
Now, a revised regression model (mod2)
will be explored, removing the categorical variable
Fumatrici due to its lack of statistical significance in
mod1.
##
## Call:
## lm(formula = Peso ~ Anni.madre + N.gravidanze + Gestazione +
## Lunghezza + Cranio + Tipo.parto + Ospedale + Sesso, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1122.63 -181.96 -14.91 161.39 2615.07
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6735.4444 141.4845 -47.606 < 2e-16 ***
## Anni.madre 0.8118 1.1467 0.708 0.4791
## N.gravidanze 11.1201 4.6627 2.385 0.0172 *
## Gestazione 32.3210 3.8138 8.475 < 2e-16 ***
## Lunghezza 10.3064 0.3006 34.285 < 2e-16 ***
## Cranio 10.4766 0.4263 24.577 < 2e-16 ***
## Tipo.partoNat 29.3770 12.0888 2.430 0.0152 *
## Ospedaleosp2 -11.0363 13.4475 -0.821 0.4119
## Ospedaleosp3 28.5194 13.5038 2.112 0.0348 *
## SessoM 77.3928 11.1858 6.919 5.77e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 274 on 2488 degrees of freedom
## (2 osservazioni eliminate a causa di valori mancanti)
## Multiple R-squared: 0.7288, Adjusted R-squared: 0.7278
## F-statistic: 742.8 on 9 and 2488 DF, p-value: < 2.2e-16
The overall fit of mod2 remains essentially unchanged,
with an adjusted R-squared of 0.7278, indicating that the explanatory
power is maintained.
All previously significant predictors—Gestazione,
Lunghezza, Cranio, SessoM,
N.gravidanze, Tipo.partoNat, and
Ospedaleosp3—retain their significance and similar effect
sizes. Non-significant variables remain Anni.madre and
Ospedaleosp2. It is therefore statistically appropriate to
remove Fumatrici.
Next, mod3 is estimated, this time also
excluding the variable Anni.madre, which likewise shows no
statistical significance in both mod1 (p = 0.4845) and
mod2 (p = 0.4791).
##
## Call:
## lm(formula = Peso ~ N.gravidanze + Gestazione + Lunghezza + Cranio +
## Tipo.parto + Ospedale + Sesso, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1113.18 -181.16 -16.58 161.01 2620.19
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6707.4293 135.9438 -49.340 < 2e-16 ***
## N.gravidanze 12.3619 4.3325 2.853 0.00436 **
## Gestazione 31.9909 3.7896 8.442 < 2e-16 ***
## Lunghezza 10.3086 0.3004 34.316 < 2e-16 ***
## Cranio 10.4922 0.4254 24.661 < 2e-16 ***
## Tipo.partoNat 29.2803 12.0817 2.424 0.01544 *
## Ospedaleosp2 -11.0227 13.4363 -0.820 0.41209
## Ospedaleosp3 28.6408 13.4886 2.123 0.03382 *
## SessoM 77.4412 11.1756 6.930 5.36e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 273.9 on 2491 degrees of freedom
## Multiple R-squared: 0.7287, Adjusted R-squared: 0.7278
## F-statistic: 836.3 on 8 and 2491 DF, p-value: < 2.2e-16
The performance of the model remains essentially unchanged, with an
adjusted R-squared of 0.7278, confirming that excluding
Anni.madre does not reduce the model’s explanatory
capacity.
All remaining predictors behave consistently with previous results.
The variables Gestazione, Lunghezza,
Cranio, and Sesso (specifically the level
SessoM) remain highly significant and continue to represent
the strongest determinants of Peso. The variable
N.gravidanze becomes slightly more influential (Estimate =
12.36) and is statistically significant. Among the categorical
predictors, Tipo.parto (level Tipo.partoNat)
and Ospedale (level Ospedaleosp3) also retain
their significance, while Ospedaleosp2 remains
non-significant.
Overall, the exclusion of Anni.madre seems to lead to a
more parsimonious model without any loss of predictive performance
To further support the model selection, the AIC and BIC values of the
regression models estimated so far (mod1,
mod2, and mod3) are compared. Both criteria
reward good model fit while penalizing unnecessary complexity.
## df BIC
## mod1 12 35215.45
## mod2 11 35208.84
## mod3 10 35228.03
## df AIC
## mod1 12 35145.57
## mod2 11 35144.78
## mod3 10 35169.79
The results show that mod2 achieves the lowest AIC
(35144.78) and the lowest BIC (35208.84), making it the best-performing
model among the three.
Although mod3 is more parsimonious (one fewer
predictor), it performs worse according to both AIC (35169.79) and BIC
(35228.03). This indicates that removing Anni.madre
slightly harms the model’s fit without offering a sufficient reduction
in complexity to compensate for it.
Overall, it is preferable to retain Anni.madre in the
model, not only because mod2 provides the best statistical
performance based on the AIC and the BIC values, but also because
maternal age is a well-established clinical control variable in neonatal
health research. Including it helps ensure that the model appropriately
accounts for known demographic influences.
The next model, mod4, evaluates the
effect of removing the categorical variable Ospedale. In
the previous mod2, only the level Ospedaleosp3
showed statistical significance at the 5% level (p = 0.0348), while
Ospedaleosp2 was clearly non-significant (p = 0.4119).
##
## Call:
## lm(formula = Peso ~ Anni.madre + N.gravidanze + Gestazione +
## Lunghezza + Cranio + Tipo.parto + Sesso, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1139.50 -181.60 -14.59 160.14 2633.16
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6737.9269 141.5747 -47.593 < 2e-16 ***
## Anni.madre 0.8793 1.1479 0.766 0.4438
## N.gravidanze 11.4176 4.6676 2.446 0.0145 *
## Gestazione 32.6300 3.8180 8.546 < 2e-16 ***
## Lunghezza 10.2839 0.3009 34.176 < 2e-16 ***
## Cranio 10.4896 0.4268 24.574 < 2e-16 ***
## Tipo.partoNat 30.1222 12.1038 2.489 0.0129 *
## SessoM 77.8374 11.2008 6.949 4.67e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 274.4 on 2490 degrees of freedom
## (2 osservazioni eliminate a causa di valori mancanti)
## Multiple R-squared: 0.7278, Adjusted R-squared: 0.727
## F-statistic: 950.9 on 7 and 2490 DF, p-value: < 2.2e-16
After removing the categorical variable Ospedale, the
new model (mod4) maintains essentially the same explanatory
power, with an adjusted R-squared of 0.7270, nearly identical to that of
mod2 (0.7278). All key predictors—Gestazione,
Lunghezza, Cranio, SessoM,
N.gravidanze, and Tipo.partoNat—remain
significant and stable in magnitude, confirming that the hospital
variable was not contributing materially to the prediction of
Peso.
The Bayesian Information Criterion (BIC) will be used from this point onward as the primary metric for model comparison. BIC imposes a stronger penalty for model complexity than AIC and therefore tends to favor simpler, more interpretable and parsimonious models.
## df BIC
## mod1 12 35215.45
## mod2 11 35208.84
## mod3 10 35228.03
## mod4 9 35202.49
Using BIC as the criterion, mod4 clearly emerges as the
best-performing model among those examined so far. Its BIC value
(35202.49) is the lowest, indicating the most favorable trade-off
between goodness of fit and model simplicity. This improvement is
achieved despite removing an entire categorical predictor with multiple
levels, reinforcing the idea that Ospedale was not a
meaningful determinant of neonatal weight.
To further justify the removal of the variable Ospedale,
an ANOVA was performed to compare the nested models mod2
(including Ospedale) and mod4 (excluding
Ospedale). The test evaluates whether removing the hospital
variable significantly worsens the model fit.
## Analysis of Variance Table
##
## Model 1: Peso ~ Anni.madre + N.gravidanze + Gestazione + Lunghezza + Cranio +
## Tipo.parto + Ospedale + Sesso
## Model 2: Peso ~ Anni.madre + N.gravidanze + Gestazione + Lunghezza + Cranio +
## Tipo.parto + Sesso
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 2488 186833870
## 2 2490 187530244 -2 -696373 4.6367 0.009774 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The results show an F-statistic of 4.637 with a p-value of 0.0098.
This indicates that, strictly statistically, the simpler model
(mod4) without Ospedale fits slightly worse
than the more complex model (mod2). However, the practical
impact on the model is minimal: the adjusted R-squared remains virtually
unchanged, and the BIC is lower for mod4, reflecting a more
parsimonious and efficient model.
Considering that Ospedale likely captures structural or
demographic differences rather than a direct biological effect on
neonatal weight, and that BIC favors simpler models, it is reasonable to
prefer mod4.
Having confirmed that removing the variable Ospedale
improves model parsimony without reducing explanatory power, a new model
is then explored—mod5—in which the
categorical variable Tipo.parto is also excluded.
##
## Call:
## lm(formula = Peso ~ Anni.madre + N.gravidanze + Gestazione +
## Lunghezza + Cranio + Sesso, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1159.84 -181.98 -15.04 164.16 2634.01
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6712.0654 141.3399 -47.489 < 2e-16 ***
## Anni.madre 0.8909 1.1491 0.775 0.4382
## N.gravidanze 11.1203 4.6710 2.381 0.0174 *
## Gestazione 32.6914 3.8219 8.554 < 2e-16 ***
## Lunghezza 10.2461 0.3008 34.058 < 2e-16 ***
## Cranio 10.5239 0.4271 24.642 < 2e-16 ***
## SessoM 77.8998 11.2125 6.948 4.72e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 274.7 on 2491 degrees of freedom
## (2 osservazioni eliminate a causa di valori mancanti)
## Multiple R-squared: 0.7271, Adjusted R-squared: 0.7264
## F-statistic: 1106 on 6 and 2491 DF, p-value: < 2.2e-16
The performance of mod5 remains very similar to the
previous specifications: the adjusted R-squared decreases only minimally
(from 0.7270 in mod4 to 0.7264 in mod5),
indicating that the explanatory power of the model is essentially
preserved.
All key biological predictors—Gestazione,
Lunghezza, Cranio, and
SessoM—remain highly significant, as well as
N.gravidanze, confirming their central role in determining
neonatal weight, while Anni.madre remains non-significant
but is retained due to its clinical relevance as a control variable.
## df BIC
## mod1 12 35215.45
## mod2 11 35208.84
## mod3 10 35228.03
## mod4 9 35202.49
## mod5 8 35200.87
Moreover, the comparison of BIC values confirms that
mod5 achieves the best overall performance among all models
without interactions. With the lowest BIC (35200.87), mod5
is preferred because the criterion explicitly rewards parsimony, and
this model attains a simpler specification—using fewer predictors—while
maintaining virtually the same adjusted R² as the more complex
alternatives. This strengthens the justification for selecting
mod5 as the most appropriate baseline model before
introducing interaction terms and non-linear terms.
Up to this point, several regression models without interaction terms
have been estimated and compared, progressively removing non-significant
predictors while monitoring model performance using the BIC criterion.
Among all specifications tested so far, the preferred model without
interactions is mod5, which includes the following
predictors: Anni.madre, N.gravidanze,
Gestazione, Lunghezza, Cranio,
and Sesso.
Having established a baseline model with main effects only, the next step is to investigate whether interaction effects may further improve the model. Interactions help capture situations in which predictors do not act independently but jointly influence the outcome.
In the following section, potential interactions between clinically or statistically relevant variables will be explored to assess whether they enhance the explanatory accuracy of the model.
The first interaction investigated is between Gestazione
and Sesso, producing model
mod_int1. The aim is to evaluate whether
the effect of gestational age on neonatal weight differs between males
and females.
##
## Call:
## lm(formula = Peso ~ Anni.madre + N.gravidanze + Gestazione +
## Lunghezza + Cranio + Sesso + Gestazione:Sesso, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1155.27 -181.13 -14.24 162.85 2632.48
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6603.2349 171.4450 -38.515 < 2e-16 ***
## Anni.madre 0.9014 1.1491 0.784 0.433
## N.gravidanze 11.0587 4.6710 2.368 0.018 *
## Gestazione 29.8161 4.6021 6.479 1.11e-10 ***
## Lunghezza 10.2501 0.3008 34.071 < 2e-16 ***
## Cranio 10.5250 0.4270 24.646 < 2e-16 ***
## SessoM -185.2394 234.9202 -0.789 0.430
## Gestazione:SessoM 6.7431 6.0131 1.121 0.262
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 274.7 on 2490 degrees of freedom
## (2 osservazioni eliminate a causa di valori mancanti)
## Multiple R-squared: 0.7272, Adjusted R-squared: 0.7265
## F-statistic: 948.3 on 7 and 2490 DF, p-value: < 2.2e-16
## df BIC
## mod5 8 35200.87
## mod_int1 9 35207.43
The results show that the interaction term
Gestazione x SessoM is not statistically
significant (p = 0.262). This pattern indicates that allowing
different gestational-age slopes for males and females does not
meaningfully improve the model.
Importantly, the adjusted R-squared (0.7265) is virtually unchanged
compared to the preferred model without interactions (mod5,
adjusted R² = 0.7264), and no improvement is observed in the residual
standard error. These results suggest that adding this interaction does
not enhance the explanatory power of the model.
Given the lack of statistical and practical contribution, the
interaction term Gestazione × Sesso does not appear
justified and will not be retained.
The second interaction explored is between Gestazione
and maternal smoking status (Fumatrici). This model, named
mod_int2, assesses whether the effect of
gestational age on neonatal weight differs between smokers and
non-smokers.
##
## Call:
## lm(formula = Peso ~ Anni.madre + N.gravidanze + Gestazione +
## Lunghezza + Cranio + Sesso + Fumatrici + Gestazione:Fumatrici,
## data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1159.92 -182.47 -16.54 164.21 2630.79
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6728.7262 142.1541 -47.334 < 2e-16 ***
## Anni.madre 0.8587 1.1492 0.747 0.4550
## N.gravidanze 11.4565 4.6771 2.449 0.0144 *
## Gestazione 33.4900 3.8616 8.673 < 2e-16 ***
## Lunghezza 10.2260 0.3012 33.956 < 2e-16 ***
## Cranio 10.5150 0.4271 24.621 < 2e-16 ***
## SessoM 78.6576 11.2256 7.007 3.12e-12 ***
## FumatriciSmoker 785.9930 757.7101 1.037 0.2997
## Gestazione:FumatriciSmoker -20.7952 19.2877 -1.078 0.2811
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 274.7 on 2489 degrees of freedom
## (2 osservazioni eliminate a causa di valori mancanti)
## Multiple R-squared: 0.7273, Adjusted R-squared: 0.7265
## F-statistic: 830 on 8 and 2489 DF, p-value: < 2.2e-16
## df BIC
## mod5 8 35200.87
## mod_int2 10 35214.13
The results show that neither the main effect of
FumatriciSmoker (p = 0.2997) nor the interaction term
Gestazione x FumatriciSmoker (p = 0.2811) is statistically
significant.
Consistent with this, the model’s adjusted R-squared (0.7265) and
residual standard error remain virtually unchanged relative to the
preferred baseline model (mod5). This confirms that the
interaction provides no improvement in explanatory
power.
Therefore, the insignificant interaction suggests that the growth trajectory across gestational weeks appears similar for newborns of smokers and non-smokers, with no detectable differential pattern in weight gain.
Addittionaly, the BIC value for mod_int2 increases
compared to mod5.
Given the absence of statistical support and the lack of contribution
to model performance, the interaction between
Gestazione x Fumatrici will not be retained.
The third interaction, which will be explored in
mod_int3 investigates whether the effect
of gestational age (Gestazione) on neonatal weight varies
depending on maternal age (Anni.madre). This evaluates a
clinically plausible hypothesis: older and younger mothers may
experience different fetal growth patterns across gestational weeks.
##
## Call:
## lm(formula = Peso ~ Anni.madre + N.gravidanze + Gestazione +
## Lunghezza + Cranio + Sesso + Anni.madre:Gestazione, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1145.57 -182.19 -14.16 162.57 2633.13
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5046.8110 626.4363 -8.056 1.21e-15 ***
## Anni.madre -56.2111 20.9597 -2.682 0.00737 **
## N.gravidanze 11.1063 4.6649 2.381 0.01735 *
## Gestazione -9.7323 16.0102 -0.608 0.54332
## Lunghezza 10.2116 0.3007 33.957 < 2e-16 ***
## Cranio 10.5231 0.4265 24.672 < 2e-16 ***
## SessoM 78.2130 11.1986 6.984 3.66e-12 ***
## Anni.madre:Gestazione 1.4718 0.5394 2.728 0.00641 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 274.4 on 2490 degrees of freedom
## (2 osservazioni eliminate a causa di valori mancanti)
## Multiple R-squared: 0.7279, Adjusted R-squared: 0.7271
## F-statistic: 951.6 on 7 and 2490 DF, p-value: < 2.2e-16
The effect of gestational age on neonatal weight varied across
maternal age was tested by including the interaction term
Anni.madre x Gestazione. Although the coefficient was
statistically significant (p = 0.00641), the effect size was small and
the adjusted R² improved only marginally (0.7264 → 0.7271). Moreover,
the inclusion of this term destabilized the main effects, with
gestational age becoming non‑significant and even negative, which
contradicts biological expectations.
Model comparison using BIC confirmed that the interaction did not enhance overall fit (BIC increased from 35200.87 to 35201.23).
## df BIC
## mod5 8 35200.87
## mod_int3 9 35201.23
Importantly, while maternal age and gestational age are individually recognized as predictors of birth weight, there is no robust clinical evidence that their interaction exerts a meaningful effect.
Overall, although the interaction term reaches statistical
significance, it does not materially enhance model performance and
introduces additional complexity without improving predictive or
interpretative value. For these reasons, the interaction between
Anni.madre and Gestazione is not retained for
the final model selection.
The interaction between N.gravidanze and
Gestazione is now explored in the model
mod_int4.
##
## Call:
## lm(formula = Peso ~ Anni.madre + N.gravidanze + Gestazione +
## Lunghezza + Cranio + Sesso + N.gravidanze:Gestazione, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1159.02 -181.98 -14.32 163.76 2632.87
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6619.5765 163.7454 -40.426 < 2e-16 ***
## Anni.madre 0.8826 1.1491 0.768 0.443
## N.gravidanze -67.1692 70.1506 -0.958 0.338
## Gestazione 30.2786 4.3885 6.900 6.59e-12 ***
## Lunghezza 10.2458 0.3008 34.059 < 2e-16 ***
## Cranio 10.5302 0.4271 24.656 < 2e-16 ***
## SessoM 77.8221 11.2121 6.941 4.95e-12 ***
## N.gravidanze:Gestazione 2.0182 1.8044 1.119 0.263
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 274.7 on 2490 degrees of freedom
## (2 osservazioni eliminate a causa di valori mancanti)
## Multiple R-squared: 0.7272, Adjusted R-squared: 0.7265
## F-statistic: 948.3 on 7 and 2490 DF, p-value: < 2.2e-16
## df BIC
## mod5 8 35200.87
## mod_int4 9 35207.44
The coefficient for the interaction term is not statistically
significant (p = 0.263), and the adjusted R-squared remains essentially
unchanged compared to the baseline model mod5.
Additionally, the BIC increases from 35200.87 (mod5) to
35207.44 (mod_int4), indicating that including this
interaction does not improve model parsimony or explanatory power.
Consequently, there is no evidence that the effect of gestational age on
newborn’s weight depends on the number of previous pregnancies, and the
interaction term is not retained in the final model.
The interaction between Cranio and
Lunghezza is explored in the model
mod_int5.
##
## Call:
## lm(formula = Peso ~ Anni.madre + N.gravidanze + Gestazione +
## Lunghezza + Cranio + Sesso + Cranio:Lunghezza, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1161.16 -180.93 -12.18 165.61 2860.38
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.833e+03 1.019e+03 -1.799 0.0721 .
## Anni.madre 8.940e-01 1.144e+00 0.782 0.4346
## N.gravidanze 1.159e+01 4.651e+00 2.492 0.0128 *
## Gestazione 3.846e+01 3.988e+00 9.645 < 2e-16 ***
## Lunghezza -3.063e-01 2.203e+00 -0.139 0.8894
## Cranio -4.773e+00 3.192e+00 -1.495 0.1350
## SessoM 7.316e+01 1.121e+01 6.529 8.01e-11 ***
## Lunghezza:Cranio 3.158e-02 6.531e-03 4.835 1.41e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 273.5 on 2490 degrees of freedom
## (2 osservazioni eliminate a causa di valori mancanti)
## Multiple R-squared: 0.7296, Adjusted R-squared: 0.7289
## F-statistic: 959.9 on 7 and 2490 DF, p-value: < 2.2e-16
The inclusion of the Cranio x Lunghezza interaction term
significantly improves the regression model, both statistically and
substantively. While head circumference and body length are individually
correlated with neonatal weight, their effects overlap because both
capture aspects of overall fetal size. The interaction term absorbs this
shared variance and reflects body proportionality more effectively than
treating the two measures separately.
Statistically, the interaction is highly significant (p < 0.001), the residual standard error decreases, and the adjusted R² increases modestly (0.7289), indicating improved fit.
BIC-based model comparison and ANOVA testing are also explored to
further investigate the significance of the interaction
Cranio x Lunghezza.
## df BIC
## mod5 8 35200.87
## mod_int5 9 35185.35
## Analysis of Variance Table
##
## Model 1: Peso ~ Anni.madre + N.gravidanze + Gestazione + Lunghezza + Cranio +
## Sesso
## Model 2: Peso ~ Anni.madre + N.gravidanze + Gestazione + Lunghezza + Cranio +
## Sesso + Cranio:Lunghezza
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 2491 187996688
## 2 2490 186248307 1 1748380 23.375 1.415e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Model comparison criteria confirm this improvement: BIC decreases
from 35200.87 (mod5) to 35185.35 (mod_int5),
and ANOVA shows that the reduction in unexplained variance is unlikely
due to chance (p ≈ 1.4 × 10⁻⁶). Therefore, the interaction between
newborn length and head circumference provides meaningful additional
explanatory power in predicting birth weight.
Clinically, all these results make sense: a proportionally large head
and long body jointly represent overall fetal growth and maturity more
powerfully than either measure alone. For these reasons, the
Cranio x Lunghezza interaction may possibly be retained in
the final model as a meaningful predictor of neonatal weight.
We investigate potential non-linear effects for three variables:
Gestazione, Lunghezza, and
Cranio. Visual inspection of correlation patterns in
section 6.1 suggests that these predictors may not follow a strictly
linear trend. Starting from the baseline linear model
(mod5), we extended the model by adding polynomial terms
for each variable individually. This approach allows us to assess
whether introducing curvature significantly improves model fit, and to
isolate which predictor—if any—benefits from a non-linear
specification.
Non-linear effects for variable Gestazione are explored
in mod_gestazione_nl.
##
## Call:
## lm(formula = Peso ~ Anni.madre + N.gravidanze + Gestazione +
## Gestazione2 + Lunghezza + Cranio + Sesso, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1155.4 -180.9 -12.6 165.0 2656.6
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4655.9158 898.7477 -5.180 2.39e-07 ***
## Anni.madre 0.9756 1.1487 0.849 0.3958
## N.gravidanze 11.0879 4.6669 2.376 0.0176 *
## Gestazione -82.2342 49.7570 -1.653 0.0985 .
## Gestazione2 1.5347 0.6625 2.317 0.0206 *
## Lunghezza 10.3522 0.3040 34.048 < 2e-16 ***
## Cranio 10.6202 0.4287 24.772 < 2e-16 ***
## SessoM 75.6414 11.2450 6.727 2.15e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 274.5 on 2490 degrees of freedom
## (2 osservazioni eliminate a causa di valori mancanti)
## Multiple R-squared: 0.7277, Adjusted R-squared: 0.7269
## F-statistic: 950.5 on 7 and 2490 DF, p-value: < 2.2e-16
## df BIC
## mod5 8 35200.87
## mod_int5 9 35185.35
## mod_gestazione_nl 9 35203.31
Introducing a quadratic term for Gestazione results in a
statistically significant coefficient for Gestazione2
(p = 0.0206), indicating a mild non-linear pattern. However,
this effect is relatively weak compared to the other predictors in the
model, whose p-values are substantially smaller. More importantly, model
fit does not improve: the BIC increases from 35200.87
(mod5) to 35203.31 (mod_gestazione_nl),
meaning that the added complexity is not justified. Overall, even though
a small curvature can be detected, it does not meaningfully enhance
explanatory or predictive performance, so Gestazione can be
kept as a linear term in the final model.
Non-linear effects for variable Lunghezza are explored
in mod_lunghezza_nl.
##
## Call:
## lm(formula = Peso ~ Anni.madre + N.gravidanze + Gestazione +
## Lunghezza + Lunghezza2 + Cranio + Sesso, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1178.63 -181.69 -12.53 163.16 1782.32
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 180.684839 725.416864 0.249 0.80332
## Anni.madre 0.768213 1.128384 0.681 0.49606
## N.gravidanze 12.932944 4.590205 2.818 0.00488 **
## Gestazione 42.809484 3.895536 10.989 < 2e-16 ***
## Lunghezza -20.242336 3.163266 -6.399 1.86e-10 ***
## Lunghezza2 0.031630 0.003267 9.681 < 2e-16 ***
## Cranio 10.637035 0.419499 25.357 < 2e-16 ***
## SessoM 69.905286 11.040380 6.332 2.87e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 269.7 on 2490 degrees of freedom
## (2 osservazioni eliminate a causa di valori mancanti)
## Multiple R-squared: 0.737, Adjusted R-squared: 0.7362
## F-statistic: 996.7 on 7 and 2490 DF, p-value: < 2.2e-16
## df BIC
## mod5 8 35200.87
## mod_int5 9 35185.35
## mod_lunghezza_nl 9 35116.40
Adding a quadratic term for Lunghezza produces a highly
significant coefficient for Lunghezza2 (p <
2e-16), indicating a strong non-linear component in the relationship
between newborn length and weight. Both the linear
(Lunghezza) and quadratic (Lunghezza2) terms
are significant, confirming the presence of curvature and justifying
their joint inclusion in the model.
Importantly, model fit improves substantially: the BIC decreases from
35200.87 (mod5) to 35116.40
(mod_lunghezza_nl), the largest improvement among the
models tested so far. This shows that the added non-linear term captures
meaningful structure in the data and enhances the model’s explanatory
and predictive performance. Therefore, a non-linear specification for
Lunghezza might be justified in the final model.
Non-linear effects for variable Cranio are explored in
mod_cranio_nl.
##
## Call:
## lm(formula = Peso ~ Anni.madre + N.gravidanze + Gestazione +
## Lunghezza + Cranio + Cranio2 + Sesso, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1148.51 -181.10 -14.44 164.85 2617.78
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 49.42349 1152.83339 0.043 0.9658
## Anni.madre 0.84341 1.14140 0.739 0.4600
## N.gravidanze 11.49937 4.63992 2.478 0.0133 *
## Gestazione 39.19182 3.95232 9.916 < 2e-16 ***
## Lunghezza 10.48787 0.30160 34.774 < 2e-16 ***
## Cranio -31.77255 7.17044 -4.431 9.78e-06 ***
## Cranio2 0.06257 0.01059 5.909 3.91e-09 ***
## SessoM 73.03006 11.16735 6.540 7.46e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 272.9 on 2490 degrees of freedom
## (2 osservazioni eliminate a causa di valori mancanti)
## Multiple R-squared: 0.7309, Adjusted R-squared: 0.7301
## F-statistic: 965.9 on 7 and 2490 DF, p-value: < 2.2e-16
## df BIC
## mod5 8 35200.87
## mod_int5 9 35185.35
## mod_cranio_nl 9 35173.91
Including a quadratic term for Cranio reveals a clear
non-linear pattern: both the linear (Cranio) and quadratic
(Cranio2) terms are highly significant (p =
9.78e-06 and p = 3.91e-09, respectively). This indicates that
the relationship between head circumference and newborn weight is curved
rather than strictly linear.
Compared to the baseline model (mod5), the BIC decreases
from 35200.87 to 35173.91 in the non-linear model
(mod_cranio_nl), confirming that the quadratic term
improves model fit. Although the improvement is smaller than the one
observed for mod_lunghezza_nl, it is still substantial
enough to possibly justify retaining a non-linear specification for
Cranio in the final model.
Now, we extend the analysis by exploring a model that includes both
non-linear effects for Lunghezza and Cranio,
called mod_nl_combined.
##
## Call:
## lm(formula = Peso ~ Anni.madre + N.gravidanze + Gestazione +
## Lunghezza + Lunghezza2 + Cranio + Cranio2 + Sesso, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1179.52 -181.55 -11.82 162.97 1767.22
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.228e+01 1.140e+03 -0.011 0.99141
## Anni.madre 7.682e-01 1.129e+00 0.681 0.49615
## N.gravidanze 1.295e+01 4.592e+00 2.820 0.00484 **
## Gestazione 4.269e+01 3.935e+00 10.849 < 2e-16 ***
## Lunghezza -2.082e+01 4.122e+00 -5.052 4.69e-07 ***
## Lunghezza2 3.222e-02 4.230e-03 7.616 3.68e-14 ***
## Cranio 1.265e+01 9.181e+00 1.378 0.16836
## Cranio2 -2.975e-03 1.355e-02 -0.219 0.82629
## SessoM 6.999e+01 1.105e+01 6.334 2.82e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 269.8 on 2489 degrees of freedom
## (2 osservazioni eliminate a causa di valori mancanti)
## Multiple R-squared: 0.737, Adjusted R-squared: 0.7361
## F-statistic: 871.8 on 8 and 2489 DF, p-value: < 2.2e-16
## df BIC
## mod5 8 35200.87
## mod_int5 9 35185.35
## mod_lunghezza_nl 9 35116.40
## mod_cranio_nl 9 35173.91
## mod_nl_combined 10 35124.18
In this model, only Lunghezza2 is highly significant
(p = 3.68e-14), indicating a strong non-linear effect of
newborn length on weight. The linear term for Lunghezza
also remains significant, while both Cranio and
Cranio2 are not significant, suggesting that head
circumference does not contribute additional explanatory power once
length is accounted for. Other predictors retain their expected effects:
gestational age and sex remain strong positive predictors, and number of
previous pregnancies has a modest positive impact.
Model comparison using BIC confirms that including only the quadratic
effect of Lunghezza (mod_lunghezza_nl) yields
the best fit (BIC = 35116.40), outperforming models with interactions
(mod_int5) or additional quadratic terms for
Cranio. This indicates that the non-linear relationship
between body length and weight is the dominant curvature in the data,
while interactions and additional quadratic terms do not meaningfully
improve model performance and should not be included.
Now, we extend the analysis by exploring a model that includes both
the non-linear effect of Lunghezza and the interaction
between length and head cirumference. This new extended model will be
called mod_int5_lunghezza_nl.
##
## Call:
## lm(formula = Peso ~ Anni.madre + N.gravidanze + Gestazione +
## Sesso + Lunghezza + Lunghezza2 + Cranio + Cranio:Lunghezza +
## Cranio:Lunghezza2, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1184.97 -180.57 -12.25 166.10 1308.88
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.169e+03 6.335e+03 -0.342 0.73209
## Anni.madre 7.154e-01 1.128e+00 0.634 0.52600
## N.gravidanze 1.316e+01 4.583e+00 2.872 0.00411 **
## Gestazione 4.070e+01 3.935e+00 10.344 < 2e-16 ***
## SessoM 7.177e+01 1.103e+01 6.507 9.27e-11 ***
## Lunghezza -2.188e+01 2.911e+01 -0.752 0.45226
## Lunghezza2 4.531e-02 3.320e-02 1.364 0.17254
## Cranio 2.699e+01 1.941e+01 1.390 0.16463
## Lunghezza:Cranio -3.275e-02 8.734e-02 -0.375 0.70772
## Lunghezza2:Cranio -1.848e-06 9.819e-05 -0.019 0.98499
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 269.2 on 2488 degrees of freedom
## (2 osservazioni eliminate a causa di valori mancanti)
## Multiple R-squared: 0.7383, Adjusted R-squared: 0.7374
## F-statistic: 779.9 on 9 and 2488 DF, p-value: < 2.2e-16
## df BIC
## mod5 8 35200.87
## mod_int5 9 35185.35
## mod_lunghezza_nl 9 35116.40
## mod_int5_lunghezza_nl 11 35119.46
The extended model (mod_int5_lunghezza_nl) includes the
quadratic effect of Lunghezza as well as interactions
between Cranio and both Lunghezza and
Lunghezza2, aiming to capture potential non-linear
relationships and body proportionality effects. In this model, none of
the newly added terms are statistically significant: the quadratic term
Lunghezza2 (p = 0.173), the linear and quadratic
interactions with Cranio (Cranio:Lunghezza,
p = 0.708; Lunghezza2:Cranio, p = 0.985),
and Cranio itself (p = 0.165) all fail to reach
conventional significance levels. Meanwhile, the previously significant
predictors (Gestazione, Sesso,
N.gravidanze) retain their expected positive effects.
Model comparison using BIC shows that adding these interaction terms
does not meaningfully improve fit: the BIC of
mod_int5_lunghezza_nl (35119.46) is slightly higher than
the model including only the quadratic effect of Lunghezza
(mod_lunghezza_nl, 35116.40) and lower than simpler linear
models (mod5 = 35200.87, mod_int5 =
35185.35).
These results indicate that the non-linear effect of
Lunghezza alone captures the dominant curvature in the
data, while interactions with Cranio do not
contribute additional explanatory power. For parsimony and
interpretability, the model including only Lunghezza2 as a
non-linear term is preferred.
An ANOVA comparing the linear model (mod5) with the
model including the quadratic effect of Lunghezza
(mod_lunghezza_nl) is carried out.
## Analysis of Variance Table
##
## Model 1: Peso ~ Anni.madre + N.gravidanze + Gestazione + Lunghezza + Cranio +
## Sesso
## Model 2: Peso ~ Anni.madre + N.gravidanze + Gestazione + Lunghezza + Lunghezza2 +
## Cranio + Sesso
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 2491 187996688
## 2 2490 181177870 1 6818817 93.714 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The results show a highly significant reduction in residual
variance (F = 93.71, p < 2.2 × 10⁻¹⁶).
This indicates that adding the quadratic term for Lunghezza
significantly improves model fit.
Based on the systematic exploration of all candidate models—including
linear models, models with interactions, and models with non-linear
terms—the model mod_lunghezza_nl is currently the preferred
regression model. This model includes the predictors
Anni.madre, N.gravidanze,
Gestazione, Lunghezza,
Lunghezza2, Cranio, and Sesso,
with the quadratic term for Lunghezza capturing the
non-linear relationship between body length and newborn weight.
The inclusion of the quadratic term significantly improves model fit,
as supported by both the ANOVA test and the reduction in BIC compared to
simpler linear models, making mod_lunghezza_nl the most
parsimonious and best-performing model among those tested.
Before confirming it as the final model, we will also evaluate its predictive accuracy using RMSE and assess multicollinearity to ensure stability and interpretability.
RMSE (Root Mean Squared Error) measures the average magnitude of the prediction errors in the same units as the outcome variable. Lower RMSE indicates better predictive accuracy.
RMSE for mod5:
## [1] 274.3335
RMSE for mod_int5:
## [1] 273.0549
RMSE for mod_lunghezza_nl:
## [1] 269.3124
For the candidate models, mod5 has an RMSE of 274.33 g.
Including the interaction term in mod_int5 slightly reduces
the RMSE to 273.05 g, indicating a modest improvement. The model
mod_lunghezza_nl, which incorporates the quadratic term for
Lunghezza, achieves the lowest RMSE of 269.31 g.
This shows that accounting for the non-linear effect of newborn length improves prediction more than including the linear interaction term.
Variance Inflation Factors (VIF) were computed to evaluate
multicollinearity in the model. Most predictors, including
Anni.madre, N.gravidanze,
Gestazione, Cranio, and Sesso,
have very low VIF values (close to 1–2), indicating minimal correlation
among them. As expected, Lunghezza and its quadratic term
Lunghezza2 exhibit very high VIFs (> 200) due to the
inherent correlation between a variable and its square.
This high VIF for Lunghezza and Lunghezza2
is a normal feature of polynomial models and does not compromise the
overall stability or predictive accuracy of the model. The other
predictors remain interpretable, and the quadratic term for
Lunghezza captures the non-linear effect on newborn weight
without introducing harmful multicollinearity.
## [1] 1.189330 1.186426 1.819041 238.038904 230.062510 1.630127 1.046128
For these reasons, mod_lunghezza_nl is selected as the
optimal regression model, balancing predictive performance,
interpretability, and biological plausibility.
The quadratic term for Lunghezza in
mod_lunghezza_nl appears to capture the non-linear
relationship between body length and birth weight more effectively than
the linear interaction Cranio × Lunghezza. Biologically,
this may reflect that as a newborn’s length increases, other body
dimensions—such as chest, abdomen, and head circumference—also tend to
grow, causing weight to increase faster than a simple linear trend.
Including Lunghezza2 allows the model to better reflect
this curvature, which is consistent with the lower RMSE and improved fit
observed in both ANOVA and BIC comparisons.
After selecting the final regression model
(mod_lunghezza_nl), it is essential to assess the model
assumptions by examining the residuals. Residuals—the differences
between observed and predicted values—provide insight into how well the
model fits the data and whether the key assumptions of linear regression
are satisfied.
In particular, the assumptions of linear regression are:
Linearity – The relationship between each predictor and the response is approximately linear.
Normality – Residuals should be normally distributed.
Homoscedasticity – Residuals should have constant variance across the entire range of predicted values.
Independence – Residuals should not be correlated with each other or with the predictor variables.
Additionally, it is important to check for extreme residuals that may disproportionately influence the regression estimates.
Below, these assumptions are explored graphically before being formally assessed through statistical tests.
A scatter plot of residuals versus fitted values shows no systematic patterns. Residuals are randomly scattered around zero, indicating that the linear model appropriately captures the trend.
The Q-Q (quantile-quantile) plot of residuals shows points roughly following the diagonal line. Minor deviations occur at the extremes, but overall, residuals appear approximately normally distributed.
The Scale-Location plot (square root of standardized residuals versus fitted values) stabilizes the variance and makes patterns more apparent. Here, the roughly horizontal band of points indicates homoscedasticity, while funnel- or cone-shaped patterns would suggest heteroscedasticity.
The Residuals vs. Leverage plot shows that point 1551 exceeds Cook’s distance thresholds, suggesting that it may represent a highly influential observation affecting the model disproportionately.
To complement the graphical evaluation, the assumptions of the final
regression model mod_lunghezza_nl are also assessed using
formal statistical tests:
##
## Shapiro-Wilk normality test
##
## data: residuals(mod_lunghezza_nl)
## W = 0.98573, p-value = 3.888e-15
The very low p-value indicates a statistically significant departure from normality, so the hypothesis of normal distribution of the residuals must be refused. However, this is common with large sample sizes [1]. Previous graphical inspection (Q-Q plot) suggests that the deviation is mostly due to extreme residuals, while the majority of residuals appear approximately normally distributed.
Moreover, the density curve of the residuals is fairly symmetric and centered around zero, although it shows a slightly longer right tail.
##
## studentized Breusch-Pagan test
##
## data: mod_lunghezza_nl
## BP = 129.06, df = 7, p-value < 2.2e-16
The significant p-value suggests the presence of heteroscedasticity, indicating that residual variance may increase for certain ranges of predicted values. The hypothesis of homoscedasticity must be refused.
##
## Durbin-Watson test
##
## data: mod_lunghezza_nl
## DW = 1.9468, p-value = 0.09152
## alternative hypothesis: true autocorrelation is greater than 0
The non-significant p-value indicates no evidence of positive autocorrelation, so residuals can be considered independent.
Overall, although mod_lunghezza_nl shows some deviations
from normality and heteroscedasticity, it satisfies the independence
assumption and captures the main predictors effectively. Given its
parsimony, and interpretability, it remains the most suitable model for
this dataset among those explored.
Leverage measures how far an observation’s predictor values are from the mean of all predictor values. Observations with high leverage have the potential to exert strong influence on the regression line, even if their residuals are small.
To assess leverage in mod_int5, we computed the hat
values for all observations and plotted them. A horizontal threshold
line at 2p/n was added, where (p) is the number of
predictors and (n) is the sample size.
A total of 133 observations lie above this threshold, indicating they have higher leverage than typical points. These observations warrant attention, as they could potentially influence the regression coefficients more strongly than the majority of the data.
Outliers are observations with unusually large residuals, which can disproportionately affect the regression estimates. To identify potential outliers, the studentized residuals are explored, which standardize residuals to account for their differing variances.
A scatter plot of the studentized residuals was created, with
horizontal reference lines at (-2) and (2) to highlight observations
with unusually large residuals. Additionally, a formal
Bonferroni-adjusted outlier test was performed using
the car::outlierTest function.
## rstudent unadjusted p-value Bonferroni p
## 1551 11.163329 2.8581e-28 7.1395e-25
## 155 5.116418 3.3514e-07 8.3719e-04
## 1306 4.791893 1.7493e-06 4.3698e-03
The analysis identified three highly significant outliers:
Observation 1551 (r_student = 11.16), Bonferroni p < 1e-24)
Observation 155 (r_student = 5.12), Bonferroni p = 8.37e-04)
Observation 1306 (r_student = 4.79), Bonferroni p = 4.37e-03)
These points exhibit exceptionally large residuals and should be carefully considered. While they may represent data errors, they could also reflect genuine extreme cases in the population.
Cook’s distance is a measure of the influence of each observation on the fitted regression coefficients. It takes into account both the leverage of the observation (its distance in the predictor space) and the size of its residual (difference between observed and predicted values), capturing the overall influence on the regression estimates. Observations with large Cook’s distance values can disproportionately affect the model and may warrant further investigation.
For mod_lunghezza_nl, Cook’s distances were computed for
all observations, and a threshold of 0.5 was considered to identify
potentially influential points. The maximum Cook’s distance is 1.267,
corresponding to Observation 1551, which exceeds the threshold,
indicating that at least one observation has substantial influence on
the model.
The plot below displays Cook’s distances for all observations, with a horizontal dashed line at 0.5 marking the threshold:
Once the final regression model (mod_lunghezza_nl) is
selected, it can be used to estimate the expected neonatal weight
(Peso) for hypothetical or real cases based on specific
values of the predictors. Predictions are computed by inputting the
chosen values for each predictor into the model equation, taking into
account both main effects and interactions.
Three illustrative examples were considered:
CASE 1) Female newborn, typical maternal and gestational characteristics
Anni.madre = 28, N.gravidanze = 3,
Gestazione = 39, Lunghezza = 495,
Cranio = 340, Sesso = “F
Predicted weight: 3257 g
Anni.madre.predict <- 28
N.gravidanze.predict <- 3
Gestazione.predict <- 39
Lunghezza.predict <- 495
Cranio.predict <- 340
Sesso.predict <- "F"
predict_df <- data.frame(
Anni.madre = Anni.madre.predict,
N.gravidanze = N.gravidanze.predict,
Gestazione = Gestazione.predict,
Lunghezza = Lunghezza.predict,
Lunghezza2 = Lunghezza.predict^2,
Cranio = Cranio.predict,
Sesso = Sesso.predict
)
peso.predict <- round(predict(mod_lunghezza_nl, newdata = predict_df), 0)
## 1
## 3257
CASE 2) Male newborn with same maternal and gestational characteristics
Sesso = “M” (other predictors identical to the first
case)
Predicted weight: 3327 g
The difference reflects the strong positive effect of male sex
(SessoM) on neonatal weight in the model.
Anni.madre.predict = 28
N.gravidanze.predict = 3
Gestazione.predict = 39
Lunghezza.predict = 495
Cranio.predict = 340
Sesso.predict = "M"
predict_df = data.frame(Anni.madre=Anni.madre.predict,
N.gravidanze=N.gravidanze.predict,
Gestazione=Gestazione.predict,
Lunghezza=Lunghezza.predict,
Lunghezza2 = Lunghezza.predict^2,
Cranio=Cranio.predict,
Sesso=Sesso.predict
)
peso.predict = round(predict(mod_lunghezza_nl, newdata = predict_df), 0)
## 1
## 3327
CASE 3) Female newborn with higher gestational age and first pregnancy
Anni.madre = 28, N.gravidanze = 3,
Gestazione = 41, Lunghezza = 495,
Cranio = 340, Sesso = “F”
Predicted weight: 3343 g
Here, the increase in gestational age produces a higher predicted weight compared to the first example.
Anni.madre.predict = 28
N.gravidanze.predict = 3
Gestazione.predict = 41
Lunghezza.predict = 495
Cranio.predict = 340
Sesso.predict = "F"
predict_df = data.frame(Anni.madre=Anni.madre.predict,
N.gravidanze=N.gravidanze.predict,
Gestazione=Gestazione.predict,
Lunghezza=Lunghezza.predict,
Lunghezza2 = Lunghezza.predict^2,
Cranio=Cranio.predict,
Sesso=Sesso.predict
)
peso.predict = round(predict(mod_lunghezza_nl, newdata = predict_df), 0)
## 1
## 3343
Visualizations play a crucial role in complementing the statistical analysis performed in the previous sections. In this chapter, a series of plots is presented to illustrate how gestational time, anthropometric measures, and sex relate to newborn weight, and to visually confirm several of the trends identified through modelling.
In the plots below, the horizontal lines represent weight median and IQR values to support visual interpretation of the results.
The first visualization explores the relationship between gestational
age (Gestazione) and newborn weight (Peso),
stratified by sex (Sesso). The scatter plot displays all
individual observations, while linear smoothing curves highlight the
trend separately for males and females and an overall trend for the
entire sample.
Comments:
Weight increases with gestational age, showing the expected positive developmental trend.
Male newborns tend to weigh more than females at every gestational age, with two visibly distinct regression lines.
All newborns delivered at 33 weeks or earlier fall below the first quartile (Q1) of the weight distribution, indicating systematically lower weight due to prematurity.
Newborns delivered between 37 and 41 weeks—the typical full-term interval—show weights concentrated around the median, reflecting normal intrauterine growth.
Outliers are present, including full-term newborns (41–42 weeks) whose weight remains below Q1, suggesting atypically low growth despite prolonged gestation.
This second visualization examines whether maternal smoking
(Fumatrici) is associated with differences in newborn
weight across gestational ages.
Comments:
The number of observations is strongly unbalanced: smokers are far fewer than non-smokers, and their data do not span the full range of gestational ages.
As a consequence, the regression line for smokers is shorter and less reliable, while the overall regression line is driven mainly by the numerous non-smoker observations, nearly overlapping with the non-smoker trend.
Within the limited gestational range where smoker data are available, the slope of the smoker regression line is smaller, suggesting that among smokers, newborn weight increases more slowly with gestational age.
However, the linear model explored earlier in section 6 showed
that the variable Fumatrici, indicating that these
graphical differences should be interpreted with caution and may not
reflect a meaningful effect once other variables are accounted
for.
This plot explores the relationship between newborn length and weight, illustrating how body size at birth relates to overall mass.
Comments:
Weight increases with newborn length, showing a clear positive relationship.
The scatter plot appears approximately linear, supporting the linearity assumption.
Regression lines for males and females are generally pretty close. At lower lengths, the female line is slightly above the male line, but they cross around 420 mm length due to the steeper slope of the male.
Most observations are concentrated around 450–540 mm in length, representing the bulk of the data.
This plot explores the relationship between newborn head circumference and weight, illustrating how body size at birth relates to overall mass.
Comments:
Weight increases with newborn head circumference, indicating a strong positive association.
Most observations fall below the regression lines at low head circumference values.
Scatter is densest between 320–360 mm, where most weights are concentrated around the IQR.
Regression lines for males and females are clearly separated, with males consistently showing higher weights.
[1] https://spsssolutions.com/shapiro-wilk-test-guide/#When_Should_You_Use_the_Shapiro-Wilk_Test
[2] https://www.ospedalebambinogesu.it/da-0-a-30-giorni-come-si-presenta-e-come-cresce-80012/
[3][ https://www.medicalnewstoday.com/articles/324728](https://www.medicalnewstoday.com/articles/324728)
This analysis was conducted using R version 4.5.1.