1 Background

Birth weight is one of the most important indicators of neonatal health, as it is strongly associated with several short- and long-term medical outcomes. Low birth weight can increase the risk of respiratory complications, metabolic disorders and developmental delays during the first stages of life.

The company Neonatal Health Solutions aims to support hospitals through data analysis and predictive modelling in order to better understand the factors that influence neonatal outcomes.

In this project we analyse a dataset containing information on 2500 newborns collected from three different hospitals. The goal is to identify the main factors affecting newborn birth weight and build a statistical model capable of predicting it using maternal characteristics, pregnancy information and neonatal measurements.

The analysis combines exploratory data analysis, hypothesis testing and multiple linear regression in order to provide interpretable insights that may help healthcare professionals improve prenatal monitoring and hospital resource planning.

2 Setting up the environment

#Define Coefficient of Variation function
CV <-function(x){
  return(sd(x)/mean(x) * 100)
}

# Define variability function
variability <- function(x) {
  rng <- max(x, na.rm = TRUE) - min(x, na.rm = TRUE)
  iqr <- IQR(x, na.rm = TRUE)
  variance <- round(var(x, na.rm = TRUE), 2)
  stdev <- round(sd(x, na.rm = TRUE), 2)
  coeff_var <- round(CV(x), 2)

  return(data.frame(
    range = rng,
    IQR = iqr,
    variance = variance,
    standard_deviation = stdev,
    coefficient_of_variation = coeff_var
  ))
}

#Define shape function
shape <- function(x) {
  #Calculate skewness e kurtosis
  skew <- round(skewness(x, na.rm = TRUE), 2)
  kurt <- round(kurtosis(x, na.rm = TRUE) - 3, 2)

  #Return values
  return(data.frame(
    skewness = skew,
    kurtosis = kurt
  ))
}

#Model fit statistics
model_fit_table <- function(model, caption = "Model fit statistics") {
  
  fit <- data.frame(
    R_squared     = summary(model)$r.squared,
    Adj_R_squared = summary(model)$adj.r.squared,
    Residual_SE   = summary(model)$sigma,
    F_statistic   = summary(model)$fstatistic[1],
    AIC           = AIC(model),
    BIC           = BIC(model)
  )
  
  fit <- round(fit, 3)
  
  knitr::kable(
    fit,
    caption = caption
  )
}

3 Variable Identification and Description

newborn <- read.csv("neonati.csv", stringsAsFactors= T, sep=",")
attach(newborn)
nrow(newborn)
## [1] 2500
Anni.madre N.gravidanze Fumatrici Gestazione Peso Lunghezza Cranio Tipo.parto Ospedale Sesso
26 0 0 42 3380 490 325 Nat osp3 M
21 2 0 39 3150 490 345 Nat osp1 F
34 3 0 38 3640 500 375 Nat osp2 M
28 1 0 41 3690 515 365 Nat osp2 M
20 0 0 38 3700 480 335 Nat osp3 F

The dataset neonati includes ten variables collected from three different hospitals, encompassing both qualitative and quantitative data, with a total of 2,500 observations. A clear understanding of the nature and type of each variable is essential for selecting appropriate statistical techniques and building a reliable predictive model for birth weight, which is the primary target variable in this study.

Below is a description and classification of each variable:

  • Mother’s Age (Anni.madre): A quantitative continuous variable representing the mother’s age in years.
  • Number of Pregnancies (N.gravidanze): A quantitative discrete variable indicating the total number of pregnancies experienced by the mother.
  • Maternal Smoking (Fumatrici): A binary qualitative nominal variable

    indicating whether the mother is a smoker (1) or non-smoker (0).
  • Gestation Duration (Gestazione): A quantitative discrete variable indicating the number of weeks of gestation.
  • Newborn Weight (Peso): A quantitative continuous variable, measured in grams, representing the birth weight. This is the dependent (target) variable for our predictive model.
  • Length (Lunghezza): A quantitative continuous variable indicating the newborn’s length in millimeter.
  • Head Circumference (Cranio): A quantitative continuous variable representing the newborn’s head circumference.
  • Type of Delivery (Tipo.parto): A qualitative nominal variable indicating the type of delivery: Vaginal (Nat) or Cesarean (Ces).
  • Birth Hospital (Ospedale): A qualitative nominal variable identifying the hospital where the delivery took place. No ordinal relationship is implied between hospitals.
  • Newborn Sex (Sesso): A qualitative nominal variable representing the newborn’s sex: Male (M) or Female (F).

4 Analysis and Modeling

4.1 Preliminary Analysis

4.1.1 Variable Overview

knitr::kable(summary(newborn))
Anni.madre N.gravidanze Fumatrici Gestazione Peso Lunghezza Cranio Tipo.parto Ospedale Sesso
Min. : 0.00 Min. : 0.0000 Min. :0.0000 Min. :25.00 Min. : 830 Min. :310.0 Min. :235 Ces: 728 osp1:816 F:1256
1st Qu.:25.00 1st Qu.: 0.0000 1st Qu.:0.0000 1st Qu.:38.00 1st Qu.:2990 1st Qu.:480.0 1st Qu.:330 Nat:1772 osp2:849 M:1244
Median :28.00 Median : 1.0000 Median :0.0000 Median :39.00 Median :3300 Median :500.0 Median :340 NA osp3:835 NA
Mean :28.16 Mean : 0.9812 Mean :0.0416 Mean :38.98 Mean :3284 Mean :494.7 Mean :340 NA NA NA
3rd Qu.:32.00 3rd Qu.: 1.0000 3rd Qu.:0.0000 3rd Qu.:40.00 3rd Qu.:3620 3rd Qu.:510.0 3rd Qu.:350 NA NA NA
Max. :46.00 Max. :12.0000 Max. :1.0000 Max. :43.00 Max. :4930 Max. :565.0 Max. :390 NA NA NA

Here below are some considerations regarding the initial analysis of the variables:

  • Mother’s Age: The maximum value for the mother’s age is 46, which is reasonable considering the onset of menopause. However, the minimum value of 0 clearly indicates a case of dirty data, likely caused by incorrect data entry or missing value encoding. These invalid observations will need to be removed to avoid bias in the analysis.
  • Number of Pregnancies: A minimum value of 0 indicates first-time pregnancies, which is expected. The mean value is approximately 1, suggesting most women in the dataset have had one or no previous pregnancies. The maximum of 12 is unusual and likely represents an outlier but not necessarily an error.
  • Maternal Smoking: The median value is 0, confirming that the majority of mothers do not smoke. The mean of ~0.04 indicates that around 4% of mothers are smokers, making it a relevant but minor subgroup for analysis.
  • Gestation Duration: The duration ranges from 25 to 43 weeks. Values below 30 likely reflect premature births or possible data inconsistencies. A median and mean of 39 weeks suggests a normally distributed gestational period, centered around full-term pregnancies.
  • Newborn Weight: The birth weight ranges from 830g to 4930g. Observations below 1000g may indicate preterm or high-risk newborns. Further analysis may be required to understand the clinical context of these low values.
  • Length: Newborn length varies from 310mm to 565mm. While no immediate red flags appear, values near the lower bound could be cross-checked with gestational age for consistency.
  • Head Circumference: The average and median are both 340mm, with a range from 235mm to 390mm. This indicates symmetry in the distribution and a moderate degree of variability among newborns.
  • Type of Delivery: The dataset contains more natural deliveries (1772) than cesarean sections (728), suggesting either a natural preference or reporting bias. Delivery type could be relevant for subgroup analysis, especially for high-risk pregnancies.
  • Birth Hospital: The distribution across hospitals is relatively even: osp2 has the most records (849), followed by osp3 (835), and osp1 (816). These differences are minor and the dataset appears balanced in this respect.
  • Newborn Sex: The sex distribution is fairly even, with a slight majority of females (1256 vs 1244 males). This balance is expected and supports unbiased model development across sex.

4.1.2 Data Cleaning

At first glance, the minimum value of the mother’s age variable warrants further investigation. A recorded minimum of 0 years is clearly invalid, as no biological or clinical scenario justifies such a value. This anomaly is most likely the result of a data entry error or a missing value incorrectly encoded as 0. To visually inspect potential outliers a scatter plot will be generated.

#Create the scatter plot
ggplot(newborn, aes(x = 1:nrow(newborn), y = Anni.madre)) +
  # Conditionally color points red if they are outliers
  geom_point(aes(color = Anni.madre < (quantile(Anni.madre, 0.25) - 1.5 * 
                                         IQR(Anni.madre))),
             size = 1) +

  # Label only the outlier points
  geom_text(aes(label = ifelse(Anni.madre < (quantile(Anni.madre, 0.25) - 1.5 * 
                                               IQR(Anni.madre)),
                               Anni.madre, "")),
            vjust = 1.5, show.legend = FALSE, color = "red", size = 3) +

  # Manual color mapping: TRUE = red, FALSE = black
  scale_color_manual(values = c("black", "red")) +

  # Styling
  labs(title = "Mother's Age Values",
       x = "Observation ID",
       y = "Mother's Age") +
  theme_light() +
  theme(legend.position = "none",
        plot.title = element_text(hjust = 0.5))

The scatter plot highlights the presence of five outliers in the variable representing mother’s age. Among them, three values (13 and 14 years old) can be considered extremely rare but biologically plausible, as early teenage pregnancies, although uncommon, are possible. However, the values 0 and 1 are clearly implausible and most likely result from data entry or recording errors.

newborn_clean <- newborn %>%
  filter(!(Anni.madre %in% c(0, 1)))
attach(newborn_clean)
## The following objects are masked from newborn:
## 
##     Anni.madre, Cranio, Fumatrici, Gestazione, Lunghezza, N.gravidanze,
##     Ospedale, Peso, Sesso, Tipo.parto
knitr::kable(summary(newborn_clean))
Anni.madre N.gravidanze Fumatrici Gestazione Peso Lunghezza Cranio Tipo.parto Ospedale Sesso
Min. :13.00 Min. : 0.0000 Min. :0.00000 Min. :25.00 Min. : 830 Min. :310.0 Min. :235 Ces: 728 osp1:816 F:1255
1st Qu.:25.00 1st Qu.: 0.0000 1st Qu.:0.00000 1st Qu.:38.00 1st Qu.:2990 1st Qu.:480.0 1st Qu.:330 Nat:1770 osp2:848 M:1243
Median :28.00 Median : 1.0000 Median :0.00000 Median :39.00 Median :3300 Median :500.0 Median :340 NA osp3:834 NA
Mean :28.19 Mean : 0.9816 Mean :0.04163 Mean :38.98 Mean :3284 Mean :494.7 Mean :340 NA NA NA
3rd Qu.:32.00 3rd Qu.: 1.0000 3rd Qu.:0.00000 3rd Qu.:40.00 3rd Qu.:3620 3rd Qu.:510.0 3rd Qu.:350 NA NA NA
Max. :46.00 Max. :12.0000 Max. :1.00000 Max. :43.00 Max. :4930 Max. :565.0 Max. :390 NA NA NA

Reviewing the summary of the cleaned dataset, it is evident that all variables, except for Mother’s Age, maintain trends consistent with those seen in the original data. The cleaning process has not significantly altered the distributions of variables such as gestation, weight, or newborn sex.
Regarding the Mother’s Age variable, the updated summary shows a minimum value of 13 and a maximum of 46, both of which are biologically plausible. The median remains at 28, and the mean is now 28.19, closely aligned with the median. This suggests a symmetrical and approximately normal distribution of maternal age in the cleaned data. The removal of invalid entries (such as ages 0 and 1) has improved the variable’s reliability and accuracy for statistical modeling.

### Descriptive Statistics

To further confirm the findings highlighted by the measures of position, an analysis of variability is essential. This allows us to better understand the variables.

#Apply the variability() function to each quantitative variable
variability_table <- do.call(rbind, lapply(newborn_clean[, quant_var], variability))

#Display the result as a nicely formatted table
knitr::kable(variability_table, digits = 2)
range IQR variance standard_deviation coefficient_of_variation
Anni.madre 33 7 27.22 5.22 18.51
N.gravidanze 12 1 1.64 1.28 130.50
Peso 4100 630 275865.90 525.23 15.99
Gestazione 18 2 3.49 1.87 4.79
Lunghezza 255 30 693.21 26.33 5.32
Cranio 155 20 269.93 16.43 4.83
  • Mother’s Age: Measures of variability indicate a moderate spread in maternal age. The range of 33 years and an IQR of 7 years suggest that most mothers are concentrated within a relatively narrow age band.The standard deviation of approximately 5.2 years and a coefficient of variation of about 18.5% indicatemoderate heterogeneity, consistent with a typical reproductive-age population.
  • Number of Pregnancies: This variable shows high relative variability. Although the absolute variance is small, the range of 12 and an IQR of 1 reveal that most observations are clustered at low values, with a few extreme cases.The very high coefficient of variation (≈130%) highlights strong dispersion relative to the mean, indicating a highly heterogeneous distribution.
  • Gestational Duration: Variability in gestational duration is relatively low.The range of 18 weeks and an IQR of 2 weeks show that most pregnancies are tightly clustered around term.A small standard deviation (≈1.9 weeks) and a low coefficient of variation (~4.8%) confirm limited dispersion, with deviations mainly driven by preterm cases.
  • Newborn weight: Birthweight exhibits substantial absolute variability, with a wide range of over 4000 g and an IQR of 630 g. The standard deviation of approximately 525 g reflects meaningful biological variation among newborns.The coefficient of variation (~16%) indicates moderate relative dispersion, which is expected in a mixed population of term and preterm infants.
  • Length: Length shows moderate absolute variability, with a wide range and an IQR of 30 mm. The standard deviation of about 26 mm indicates noticeable but controlled dispersion.The low coefficient of variation (~5.3%) suggests that, relative to its mean, length is a stable anthropometric measure.
  • Head Circumference: Cranial diameter displays low variability compared to other measures. The range of 155 mm and an IQR of 20 mm indicate a relatively compact distribution. A standard deviation of approximately 16.4 mm and a low coefficient of variation (~4.8%) confirm high measurement consistency across newborns.

To enhance the finding a graphical representation and an in-deep analysis of skewness and kurtosis and the boxplot is essential.

#Apply the shape() function to each quantitative variable
shape_table <- do.call(rbind, lapply(newborn_clean[, quant_var], shape))

#Display the result as a nicely formatted table
knitr::kable(shape_table, digits = 2)
skewness kurtosis
Anni.madre 0.15 -0.11
N.gravidanze 2.51 10.98
Peso -0.65 2.03
Gestazione -2.07 8.26
Lunghezza -1.51 6.48
Cranio -0.79 2.94
  • Mother’s Age: The distribution of maternal age is approximately symmetric, as indicated by a skewness value close to zero (0.15). The slightly negative kurtosis (-0.11) suggests a distribution very close to normal, with no evidence of heavy tails or extreme outliers. Overall, mother’s age shows a well-balanced shape.
  • Number of Pregnancies: This variable exhibits strong positive skewness (2.51), indicating that most mothers have few pregnancies, while a small number have many. The very high kurtosis (10.98) reflects a highly peaked distribution with heavy tails, driven by extreme but plausible values.These shape characteristics confirm a pronounced departure from normality.
  • Newborn weight: Newborn weight shows moderate negative skewness (-0.65), suggesting a longer lower tail associated with low-birthweight and preterm infants.A kurtosis value of 2.03 indicates heavier tails than a normal distribution, reflecting clinically meaningful variability rather than data errors.
  • Gestational Duration: Gestational duration is strongly negatively skewed (-2.07), with a sharp lower tail caused by preterm births.The high kurtosis (8.26) indicates a very peaked distribution with heavy tails.This shape suggests that gestational duration deviates substantially from normality.
  • Length: Length exhibits negative skewness (-1.51), indicating a longer lower tail linked to growth restriction and prematurity.The elevated kurtosis (6.48) points to a leptokurtic distribution with substantial tail weight.These features highlight the presence of extreme but clinically relevant observations.
  • Head Circumference: Head Circumference shows mild negative skewness (-0.79), suggesting slight asymmetry toward lower values.A kurtosis of 2.94 indicates moderately heavy tails, but less extreme than those observed for gestational age or number of pregnancies.
#Reshape the dataset to long format
long_data <- newborn_clean %>%
  pivot_longer(cols = all_of(quant_var), names_to = "Variable", values_to = "Value")

#Compute median and upper outlier threshold for each variable
lines_df <- long_data %>%
  group_by(Variable) %>%
  summarise(
    Q2 = quantile(Value, 0.5, na.rm = TRUE),
    LowerOut = quantile(Value, 0.25, na.rm = TRUE) - 1.5 * IQR(Value, na.rm = 
                                                                 TRUE),
    UpperOut = quantile(Value, 0.75, na.rm = TRUE) + 1.5 * IQR(Value, na.rm =                                                                 TRUE)
  )

#Create the boxplots
ggplot(long_data, aes(x = "", y = Value)) +
  geom_boxplot(fill = "lightblue", outlier.shape = 16, outlier.color = "red") + 
  geom_hline(data = lines_df, aes(yintercept = Q2), color = "red") +            
  geom_hline(data = lines_df, aes(yintercept = UpperOut), color = "black") +    
  geom_hline(data = lines_df, aes(yintercept = LowerOut), color = "grey40") +   
  facet_wrap(~ Variable, scales = "free_y") +
  labs(title = "Boxplots of Quantitative Variables with Outliers and Thresholds",
       x = NULL, y = NULL) +
  theme_minimal(base_size = 13) +
  theme(
    plot.title = element_text(hjust = 0.5),
    axis.text.x = element_blank(),
    axis.ticks.x = element_blank()
  )            


Overall, the boxplots visually confirm the distributional properties identified through skewness and kurtosis. Variables such as number of pregnancies and gestational duration show pronounced asymmetry and a high concentration of extreme values,which is reflected in the presence of numerous outliers and elongated whiskers. Similarly, the heavier lower tails observed in the boxplots for newborn weight, length, and Head Circumference are consistent with their negative skewness and elevated kurtosis values. In contrast, mother’s age displays a more balanced boxplot, in line with skewness and kurtosis values close to those of a normal distribution. These visual patterns reinforce the quantitative evidence and support the conclusions drawn from the measures of distribution shape.

Before focusing on the inferential analysis and the formal testing of the research hypotheses, it is useful to conclude the descriptive phase by graphically representing the main variables.Visual exploration provides a more intuitive and comprehensive overview of the data,allowing patterns, differences, and potential anomalies to be identified more easily than through numerical summaries alone. These graphical representations complement the descriptive statistics and help contextualize the subsequent hypothesis tests.

# Prepare the variable

qual_df <- newborn_clean %>%
  mutate(
    Fumatrici  = factor(Fumatrici, levels = c(0, 1), labels = c("No smoking", "Smoking")),
    Tipo.parto = factor(Tipo.parto, levels = c("Nat", "Ces"), labels = c("Vaginal", "Cesarean")),
    Ospedale   = factor(Ospedale),
    Sesso      = factor(Sesso, levels = c("F", "M"), labels = c("Female", "Male"))
  )

freq_cat <- qual_df %>%
  select(all_of(qual_var)) %>%
  pivot_longer(cols = everything(), names_to = "Variable", values_to = "Category") %>%
  count(Variable, Category, name = "n") %>%
  group_by(Variable) %>%
  mutate(p = n / sum(n)) %>%
  ungroup() %>%
  mutate(
    Variable = dplyr::recode(
      Variable,
      Fumatrici  = "Maternal smoking",
      Tipo.parto = "Delivery type",
      Ospedale   = "Birth hospital",
      Sesso      = "Newborn sex"
    ),
    label = paste0(n, " (", percent(p, accuracy = 0.1), ")")
  )

# Create the graph
p_cat <- ggplot(freq_cat, aes(x = n, y = Category, fill = Category)) +
  geom_col(width = 0.55, show.legend = FALSE) +
  geom_text(aes(label = label), hjust = -0.08, size = 3.6) +
  facet_wrap(~ Variable, scales = "free_y", ncol = 2) +
  scale_x_continuous(expand = expansion(mult = c(0, 0.15))) +
  scale_fill_brewer(palette = "Set3") +  
  labs(
    title = "Qualitative variables — frequency distributions",
    x = "Number of observations",
    y = NULL
  ) +
  theme_minimal(base_size = 12) +
  theme(
    plot.title = element_text(face = "bold", hjust = 0.5),
    plot.subtitle = element_text(size = 10),
    strip.text = element_text(face = "bold"),
    panel.grid.minor = element_blank()
  )

p_cat

# Function to build frequency tables
freq_table <- function(data, var) {
  data %>%
    count({{ var }}, name = "Count") %>%
    mutate(
      Percentage = Count / sum(Count),
      Percentage = percent(Percentage, accuracy = 0.1)
    )
}

tab_hospital  <- freq_table(qual_df, Ospedale)
tab_delivery  <- freq_table(qual_df, Tipo.parto)
tab_smoking   <- freq_table(qual_df, Fumatrici)
tab_sex       <- freq_table(qual_df, Sesso)

kable(tab_hospital)
Ospedale Count Percentage
osp1 816 32.7%
osp2 848 33.9%
osp3 834 33.4%
kable(tab_delivery)
Tipo.parto Count Percentage
Vaginal 1770 70.9%
Cesarean 728 29.1%
kable(tab_smoking)
Fumatrici Count Percentage
No smoking 2394 95.8%
Smoking 104 4.2%
kable(tab_sex)
Sesso Count Percentage
Female 1255 50.2%
Male 1243 49.8%



Overall, the qualitative variables show a clear and interpretable structure: the sample is well balanced across the three hospitals (each contributing roughly one third of the observations), delivery mode is predominantly vaginal with a substantial minority of cesarean sections, and newborn sex is almost perfectly balanced. Maternal smoking is relatively rare in this cohort.



pretty_labels <- c(
  "Mother's age (years)",
  "Number of pregnancies",
  "Gestation (weeks)",
  "Birthweight (g)",
  "Length (cm)",
  "Head circumference"
)

num_long <- newborn_clean %>%
  select(all_of(quant_var)) %>%
  pivot_longer(cols = everything(), names_to = "Variable", values_to = "Value") %>%
  mutate(
    Variable = factor(Variable, levels = quant_var, labels = pretty_labels)
  )

# Only Mother’s age qualifies based on your skewness/kurtosis
normal_vars <- c("Mother's age (years)")

curve_df <- num_long %>%
  filter(Variable %in% normal_vars) %>%
  group_by(Variable) %>%
  summarise(
    mu = mean(Value, na.rm = TRUE),
    sigma = sd(Value, na.rm = TRUE),
    xmin = quantile(Value, 0.01, na.rm = TRUE),
    xmax = quantile(Value, 0.99, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  rowwise() %>%
  mutate(x = list(seq(xmin, xmax, length.out = 200))) %>%
  unnest(cols = c(x)) %>%
  mutate(y = dnorm(x, mean = mu, sd = sigma)) %>%
  ungroup()

p_num_scientific <- ggplot(num_long, aes(x = Value)) +
  geom_histogram(aes(y = after_stat(density)),
                 bins = 35, alpha = 0.85) +
  geom_line(
    data = curve_df,
    aes(x = x, y = y),
    linewidth = 1
  ) +
  facet_wrap(~ Variable, scales = "free", ncol = 3) +
  labs(
    title = "Numeric variables — distributions (post-cleaning)",
    subtitle = "Normal curve added only where skewness and kurtosis indicate approximate normality",
    x = NULL,
    y = "Density"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    plot.title = element_text(face = "bold", hjust = 0.5),
    plot.subtitle = element_text(size = 10),
    strip.text = element_text(face = "bold"),
    panel.grid.minor = element_blank()
  )

p_num_scientific


Normal density curves were superimposed only for variables that satisfied approximate normality conditions, defined by |skewness| < 0.5 and kurtosis close to 0. Based on these criteria, only maternal age exhibited a distribution sufficiently close to normal, while all other variables showed significant asymmetry and/or heavy tails. Importantly, the visual patterns observed in the histograms are fully consistent with the previously computed numerical shape measures (skewness and kurtosis) as well as with the information provided by the boxplots. In particular, variables characterized by strong skewness and heavy tails in the numerical analysis also display asymmetric shapes and numerous outliers in the boxplots, whereas variables closer to normality show more symmetric and compact distributions. This coherence between graphical and numerical summaries reinforces the robustness of the descriptive analysis and supports the interpretation of the underlying distributional properties of the data.

Having completed this descriptive phase, we now proceed to the inferential analysis, where the following research hypotheses will be formally tested using appropriate statistical methods: (i) whether the proportion of cesarean deliveries differs across hospitals, (ii) whether the mean birthweight and mean length of the newborns in this sample are significantly equal to the corresponding population values, and (iii) whether anthropometric measurements differ significantly between male and female newborns.

4.2 Hypothesis Testing

Cesarean delivery is performed more often in some hospital

To investigate whether the frequency of cesarean deliveries differs across hospitals, considering also that both variables are categorical, a Chi-square test of independence will be applied to the contingency table formed by hospital and delivery type. This test assesses whether the observed distribution of vaginal and cesarean births deviates from what would be expected under the assumption of independence. A significant result would indicate that the likelihood of undergoing a cesarean section varies by hospital, suggesting differences in clinical practices or patient populations.

This can be formulated as:
  • Null hypothesis (H₀): The proportion of cesarean deliveries is the same across hospitals (delivery type is independent of hospital).
  • Alternative hypothesis (H₁): The proportion of cesarean deliveries differs across hospitals (delivery type is associated with hospital).
# Contingency table: Hospital × Delivery type
tab_delivery_hospital <- table(qual_df$Ospedale, qual_df$Tipo.parto)

tab_delivery_hospital_df <- as.data.frame(tab_delivery_hospital) %>%
  rename(Hospital = Var1, DeliveryType = Var2, n = Freq) %>%
  group_by(Hospital) %>%
  mutate(prop = n / sum(n)) %>%
  ungroup()


kable(tab_delivery_hospital)
Vaginal Cesarean
osp1 574 242
osp2 594 254
osp3 602 232
# Chi-square test

chi <- chisq.test(tab_delivery_hospital)
knitr::kable(tidy(chi))
statistic p.value parameter method
1.082984 0.5818793 2 Pearson’s Chi-squared test
# Check assumptions: expected frequencies

chi_expected <- chisq.test(tab_delivery_hospital)$expected
knitr::kable(round(chi_expected, 1))
Vaginal Cesarean
osp1 578.2 237.8
osp2 600.9 247.1
osp3 590.9 243.1
# Grahpical rapresentation 

p1 <- ggplot(tab_delivery_hospital_df, aes(x = Hospital, y = prop, fill = DeliveryType)) +
  geom_col(position = "fill", width = 0.7) +
  scale_y_continuous(labels = percent_format(accuracy = 1)) +
  labs(
    title = "Delivery type distribution by hospital",
    x = "Hospital",
    y = "Proportion",
    fill = "Delivery type"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    plot.title = element_text(face = "bold", hjust = 0.5),
    panel.grid.minor = element_blank()
  )

p1


A Pearson Chi-square test of independence was used to assess whether the proportion of cesarean deliveries differs across hospitals. The test was not statistically significant (χ2 = 1.083, df = 2, p = 0.5819), indicating no evidence that delivery type is associated with the hospital in this sample. The expected frequencies under the null hypothesis were all large (minimum expected count ≈ 237.8), satisfying the assumptions required for the Chi-square approximation. Therefore, based on these data we conclude that the observed differences in cesarean rates across the three hospitals are compatible with random variation rather than systematic differences in clinical practice.


The average anthropometric characteristics observed in this sample of newborns—specifically birthweight and length—are representative of those observed in the general population


The second hypothesis assesses whether the average anthropometric characteristics observed in this sample of newborns—specifically Newborn weight and length—are statistically consistent with the corresponding population reference values. This step is important to evaluate the external validity of the dataset, i.e., whether the sample can be considered representative of the broader population and therefore suitable for generalizable clinical interpretation and predictive modeling. Since the comparison is between a sample mean to a known population value, the appropriate test is a one-sample t-test. Before performing the test, we will check:

  1. Independence of observations (guaranteed by study design).
  2. Approximate normality of the variable (using histograms, skewness, kurtosis, and boxplots).
  3. Sample size, since large samples make the t-test robust to deviations from normality.
mu_weight <- 3300
mu_length <- 500

# One-sample t-test: Birthweight
t_weight <- t.test(newborn_clean$Peso, mu = mu_weight)

# One-sample t-test: Length
t_length <- t.test(newborn_clean$Lunghezza, mu = mu_length)

# Create a summary table
results <- data.frame(
  Variable = c("Birthweight", "Length"),
  Mean = c(mean(newborn_clean$Peso),
           mean(newborn_clean$Lunghezza)),
  t_value = c(t_weight$statistic, t_length$statistic),
  df = c(t_weight$parameter, t_length$parameter),
  p_value = c(t_weight$p.value, t_length$p.value)
)

# Round ONLY numeric columns
results[, -1] <- round(results[, -1], 3)

knitr::kable(results,
             caption = "t-tests comparing sample means to population reference values")
(#tab:one_sample_t_tests)t-tests comparing sample means to population reference values
Variable Mean t_value df p_value
Birthweight 3284.184 -1.505 2497 0.132
Length 494.696 -10.069 2497 0.000


A one-sample t-test was conducted to assess whether the mean birthweight and length observed in this sample of newborns differ from their respective population reference values. For birthweight, the sample mean (3284.2 g) is slightly lower than the reference value of 3300 g; however, this difference is not statistically significant (t = −1.505, p = 0.132). Therefore, there is no evidence to reject the null hypothesis, and the sample mean birthweight can be considered statistically consistent with the population reference value.
In contrast, the results for length show a statistically significant difference from the population reference value (t = −10.069, p = 0). The observed sample mean (494.7 units) is substantially lower than the reference value, indicating that newborn length in this sample differs systematically from the expected population average. Given the large sample size, this result is both statistically robust and unlikely to be due to random variation alone.
Overall, while birthweight appears representative of population standards, newborn length does not. This suggests that, although the dataset is broadly representative in terms of weight, caution is warranted when interpreting or generalizing results related to length, as the sample may reflect structural differences (e.g., measurement scale, clinical practices, or population characteristics) relative to the chosen reference values.

Anthropometric measurements differ significantly between the two sexes.
The third hypothesis investigates whether anthropometric measurements differ significantly between male and female newborns. From a statistical perspective, this hypothesis involves comparing the distribution of continuous anthropometric variables—such as birthweight, length, and head circumference—between two independent groups defined by newborn sex. The analysis will therefore focus on evaluating whether the mean values of these measurements differ between males and females.
Depending on the distributional properties of the variables and the robustness of the results observed during the descriptive analysis, appropriate two-sample comparison tests will be employed. In particular, independent-samples t-tests will be used when assumptions of approximate normality and homoscedasticity are reasonably satisfied.

# Data preparation
sex_df <- newborn_clean %>%
  mutate(Sesso_raw = trimws(toupper(as.character(Sesso)))) %>%
  mutate(
    Sesso = case_when(
      Sesso_raw %in% c("F", "FEMALE", "FEMMINA") ~ "Female",
      Sesso_raw %in% c("M", "MALE", "MASCHIO")   ~ "Male",
      TRUE ~ NA_character_
    )
  ) %>%
  filter(!is.na(Sesso)) %>%
  mutate(Sesso = factor(Sesso, levels = c("Female", "Male")))


# Variables to compare
anthro_vars <- c("Peso", "Lunghezza", "Cranio")

# Boxplot
sex_long <- sex_df %>%
  select(Sesso, all_of(anthro_vars)) %>%
  pivot_longer(cols = all_of(anthro_vars),
               names_to = "Variable",
               values_to = "Value") %>%
  mutate(
    Variable = dplyr::recode(
      Variable,
      Peso = "Birthweight",
      Lunghezza = "Length",
      Cranio = "Head circumference"
    )
  )

p_box <- ggplot(sex_long, aes(x = Sesso, y = Value, fill = Sesso)) +
  geom_boxplot(width = 0.55, outlier.alpha = 0.25) +
  facet_wrap(~ Variable, scales = "free", ncol = 3) +
  labs(
    title = "Anthropometric measures by sex",
    x = NULL,
    y = NULL
  ) +
  theme_minimal(base_size = 12) +
  theme(
    plot.title = element_text(face = "bold", hjust = 0.5),  
    strip.text = element_text(face = "bold"),
    legend.position = "none",
    panel.grid.minor = element_blank()
  )

p_box

# t-test 
t_peso <- t.test(Peso ~ Sesso, data = sex_df)
t_lung <- t.test(Lunghezza ~ Sesso, data = sex_df)
t_cran <- t.test(Cranio ~ Sesso, data = sex_df)

results <- data.frame(
  Variable = c("Birthweight", "Length", "Head circumference"),
  Female_mean = c(
    mean(sex_df$Peso[sex_df$Sesso == "Female"], na.rm = TRUE),
    mean(sex_df$Lunghezza[sex_df$Sesso == "Female"], na.rm = TRUE),
    mean(sex_df$Cranio[sex_df$Sesso == "Female"], na.rm = TRUE)
  ),
  Male_mean = c(
    mean(sex_df$Peso[sex_df$Sesso == "Male"], na.rm = TRUE),
    mean(sex_df$Lunghezza[sex_df$Sesso == "Male"], na.rm = TRUE),
    mean(sex_df$Cranio[sex_df$Sesso == "Male"], na.rm = TRUE)
  ),
  t_value = c(unname(t_peso$statistic), unname(t_lung$statistic), unname(t_cran$statistic)),
  df = c(unname(t_peso$parameter), unname(t_lung$parameter), unname(t_cran$parameter)),
  p_value = c(t_peso$p.value, t_lung$p.value, t_cran$p.value)
)

# Round ONLY numeric columns
results[, -1] <- round(results[, -1], 3)

knitr::kable(
  results,
  caption = "Two-sample t-tests comparing anthropometric measures between female and male newborns"
)
(#tab:anthropometric_hypothesis_test)Two-sample t-tests comparing anthropometric measures between female and male newborns
Variable Female_mean Male_mean t_value df p_value
Birthweight 3161.061 3408.496 -12.115 2488.674 0
Length 489.764 499.675 -9.582 2457.301 0
Head circumference 337.623 342.459 -7.437 2489.389 0


The results of the two-sample t-tests indicate that all three anthropometric measures —birthweight, length, and head circumference— differ significantly between female and male newborns. In each case, the p-value is effectively zero, providing very strong statistical evidence against the null hypothesis of equal means between sexes.
Specifically, male newborns exhibit higher average values for all considered measures. The mean birthweight of male newborns (≈ 3408 g) is substantially higher than that of females (≈ 3161 g). Similarly, males show larger average head circumference (≈ 342 mm vs ≈ 338 mm) and greater average length (≈ 499 mm vs ≈ 490 mm). The negative t-values observed in all tests reflect the fact that the mean for females is consistently lower than that for males.
The accompanying boxplots visually reinforce these findings. For each anthropometric variable, the distribution for male newborns is clearly shifted upward relative to that for females, with higher medians and upper quartiles. While there is some overlap between the distributions—expected in biological data—the overall separation between sexes is evident and consistent across all three measures.
Importantly, the graphical evidence is fully coherent with the numerical results of the t-tests. The systematic differences in central tendency observed in the boxplots align with the statistically significant test outcomes, supporting the conclusion that sex is a meaningful determinant of neonatal anthropometric characteristics in this dataset. As a result, newborn sex should be considered a relevant explanatory variable in subsequent analyses and predictive modeling.

4.3 Regression Model


The next step of the analysis focuses on the development of a multiple linear regression model aimed at explaining and predicting newborn birthweight. Birthweight is treated as a continuous response variable and represents a key indicator of neonatal health, making it a suitable outcome for a regression-based modeling approach. By fitting a multiple linear regression model, it will be possible to quantify the marginal effect of each predictor on birthweight while holding all other variables constant. This approach allows us to disentangle the individual contribution of correlated factors and to identify the most influential determinants of neonatal weight.

As a first step in the regression workflow, a correlation matrix is computed for the quantitative variables included in the analysis. This provides an initial overview of the linear relationships between predictors and the response variable, and it is particularly useful to identify clusters of strongly correlated predictors (e.g., anthropometric measures) that may lead to multicollinearity in a multiple regression model. In practice, high correlations suggest that some predictors may carry overlapping information, potentially inflating standard errors and reducing the stability and interpretability of coefficient estimates. For this reason, the correlation structure is used to guide subsequent modeling choices, such as prioritizing clinically meaningful variables, evaluating multicollinearity, and motivating potential model refinements during the selection phase.

#select numerica variables for correlation
corr_df <- newborn_clean %>%
  select(all_of(quant_var)) %>%
  mutate(across(everything(), as.numeric))

# Correlation matrix
corr_mat <- cor(corr_df, use = "complete.obs", method = "pearson")

# Correlation matrix as a report table 
kable(round(corr_mat, 3), caption = "Correlation matrix (Pearson) for quantitative variables")
(#tab:correlation_matrix)Correlation matrix (Pearson) for quantitative variables
Anni.madre N.gravidanze Peso Gestazione Lunghezza Cranio
Anni.madre 1.000 0.383 -0.024 -0.135 -0.065 0.016
N.gravidanze 0.383 1.000 0.002 -0.102 -0.060 0.039
Peso -0.024 0.002 1.000 0.592 0.796 0.705
Gestazione -0.135 -0.102 0.592 1.000 0.619 0.461
Lunghezza -0.065 -0.060 0.796 0.619 1.000 0.603
Cranio 0.016 0.039 0.705 0.461 0.603 1.000
# Custom correlation panel
panel.cor <- function(x, y, digits = 2, prefix = "", ...) {
  par(usr = c(0, 1, 0, 1))
  r <-cor(x, y)
  txt <- formatC(r, digits = digits, format = "f")

  # Base size + gentle scaling
  cex_base  <- 0.9
  cex_scale <- 1.2

  text(0.5, 0.5, txt, cex = cex_base + cex_scale * r)
}


# Select quantitative variable
corr_df <- newborn_clean %>%
  select(all_of(quant_var)) %>%
  mutate(across(everything(), as.numeric))

# Scatterplot matrix with correlations

pairs(
  corr_df,
  upper.panel = panel.cor,
  lower.panel = panel.smooth,
  main = "Scatterplot matrix and pairwise correlations (quantitative variables)"
)


The scatterplot matrix provides a comprehensive overview of the pairwise relationships among the quantitative variables included in the regression analysis. Several clear linear associations emerge, particularly between birthweight and the anthropometric measures available before birth. Birthweight shows a strong positive correlation with newborn length (r ≈ 0.80) and head circumference (r ≈ 0.70), indicating that larger fetal dimensions are closely associated with higher weight at birth. A moderate to strong positive relationship is also observed between gestational duration and birthweight (r ≈ 0.59), which is consistent with clinical expectations: longer gestation allows for greater fetal growth. Gestational duration is additionally correlated with length (r ≈ 0.62) and, to a lesser extent, with head circumference (r ≈ 0.46), suggesting that gestation affects multiple dimensions of neonatal growth. Maternal characteristics such as mother’s age and number of pregnancies exhibit relatively weak correlations with birthweight and the other anthropometric variables. This indicates that, while these factors may still play a role in a multivariable context, their direct linear association with neonatal size is limited when considered in isolation. Overall, the correlation structure highlights the presence of strong interrelationships among anthropometric predictors, particularly length and head circumference, which may introduce multicollinearity in the regression model.

Based on the insights obtained from the descriptive analysis and the correlation structure of the quantitative variables, a multiple linear regression model is specified with newborn birthweight as the response variable. The initial formulation of the model includes all predictors that are clinically plausible or potentially associated with birthweight, allowing for a comprehensive assessment of their joint effects within a multivariable framework. This full specification serves as a reference model from which more parsimonious alternatives can be derived through subsequent model refinement and selection procedures.

mod1 <- lm(
  Peso ~ Anni.madre + N.gravidanze + Fumatrici + Gestazione +
         Lunghezza + Cranio + Tipo.parto + Ospedale + Sesso,
  data = newborn_clean
)


kable(
  round(summary(mod1)$coefficients, 3),
  caption = "Full multiple linear regression model for newborn birthweight"
)
(#tab:full_model_formula)Full multiple linear regression model for newborn birthweight
Estimate Std. Error t value Pr(>|t|)
(Intercept) -6735.796 141.479 -47.610 0.000
Anni.madre 0.802 1.147 0.699 0.484
N.gravidanze 11.381 4.669 2.438 0.015
Fumatrici -30.274 27.549 -1.099 0.272
Gestazione 32.577 3.821 8.526 0.000
Lunghezza 10.292 0.301 34.207 0.000
Cranio 10.472 0.426 24.567 0.000
Tipo.partoNat 29.634 12.091 2.451 0.014
Ospedaleosp2 -11.091 13.447 -0.825 0.410
Ospedaleosp3 28.250 13.505 2.092 0.037
SessoM 77.572 11.187 6.934 0.000


The results of the full multiple linear regression model highlight several statistically significant determinants of newborn birthweight, while others do not appear to contribute meaningfully once all covariates are considered simultaneously. Among maternal characteristics, mother’s age does not show a significant association with birthweight, whereas the number of previous pregnancies is positively and significantly associated with birthweight, suggesting that multiparity is linked to slightly heavier newborns. As expected, gestational duration emerges as a strong and highly significant predictor: each additional week of gestation is associated with an average increase in birthweight, holding all other variables constant. Similarly, the anthropometric measures newborn length and head circumference show very strong positive effects, confirming their close relationship with birthweight and their central role in explaining variability in neonatal size. Regarding behavioral and contextual factors, maternal smoking exhibits a negative coefficient, indicating lower birthweight among newborns of smoking mothers; however, this effect is not statistically significant in the full model. The type of delivery shows a modest but statistically significant association, with vaginal deliveries associated with slightly higher birthweight compared to the reference category. This result likely reflects underlying pregnancy characteristics rather than a causal effect of delivery mode. Hospital-related effects are limited: one hospital does not differ significantly from the reference hospital, while another shows a small but statistically significant difference in average birthweight. Finally, newborn sex remains a highly significant predictor even after adjustment for all other variables, with male newborns weighing on average more than female newborns. Overall, the model confirms that gestational age, anthropometric measures, parity, and sex are the primary drivers of birthweight in this dataset, while other factors play a secondary or negligible role once these key predictors are taken into account.

model_fit_table(
  mod1,
  caption = "Model fit statistics for the full regression model")
(#tab:adjusted_r_squared)Model fit statistics for the full regression model
R_squared Adj_R_squared Residual_SE F_statistic AIC BIC
value 0.729 0.728 274.021 668.676 35145.57 35215.45

The overall goodness-of-fit statistics indicate that the full multiple linear regression model provides a strong explanatory power for newborn birthweight. The model achieves an R2 of 0.729, meaning that approximately 73% of the variability in birthweight is explained by the set of predictors included in the model. The adjusted R2 (0.728) is very close to the unadjusted R2, suggesting that the inclusion of multiple predictors does not result in substantial overfitting and that most variables contribute meaningfully to the model once complexity is taken into account. This stability supports the adequacy of the model specification. The residual standard error (≈ 274 g) provides an estimate of the typical deviation between observed and predicted birthweights, indicating that, on average, model predictions deviate from the observed values by less than 300 grams. Finally, the highly significant F-statistic confirms that the model, as a whole, explains a significant portion of the variability in birthweight and performs substantially better than a null model with no predictors.


Starting from the full model, variable selection procedures based on information criteria are applied to identify a more parsimonious specification. Stepwise selection using the Akaike Information Criterion (AIC) balances goodness of fit and model complexity, while the Bayesian Information Criterion (BIC) provides a more conservative alternative that favors simpler models. This step aims to retain only those predictors that meaningfully contribute to explaining birthweight, improving interpretability without sacrificing explanatory power.

# Backward stepwise removing Mother's age

mod2 <- update(mod1,~.-Anni.madre)

# Coefficient table

kable(
  round(summary(mod2)$coefficients, 3),
  caption = "Regression model after removing Mother's age")
(#tab:backward_remove_anni_madre)Regression model after removing Mother’s age
Estimate Std. Error t value Pr(>|t|)
(Intercept) -6708.619 136.021 -49.320 0.000
N.gravidanze 12.583 4.340 2.899 0.004
Fumatrici -30.427 27.545 -1.105 0.269
Gestazione 32.300 3.800 8.501 0.000
Lunghezza 10.292 0.301 34.209 0.000
Cranio 10.487 0.426 24.638 0.000
Tipo.partoNat 29.665 12.089 2.454 0.014
Ospedaleosp2 -10.951 13.444 -0.815 0.415
Ospedaleosp3 28.517 13.499 2.113 0.035
SessoM 77.645 11.185 6.942 0.000
# Model fit statistics

model_fit_table(
  mod2,
  caption = "Model fit statistics for the backward stepwise-selected: removing mother's age")
(#tab:backward_remove_anni_madre)Model fit statistics for the backward stepwise-selected: removing mother’s age
R_squared Adj_R_squared Residual_SE F_statistic AIC BIC
value 0.729 0.728 273.993 743.072 35144.06 35208.12


The regression model obtained after removing mother’s age (Anni.madre) shows results that are largely consistent with those of the full specification. The exclusion of maternal age does not materially alter the estimated coefficients of the remaining predictors, nor does it substantially change their statistical significance. This suggests that mother’s age did not provide meaningful additional explanatory power once the other covariates were included. The key determinants of birthweight remain unchanged. Gestational duration, newborn length, and head circumference continue to exhibit strong and highly significant positive associations with birthweight, confirming their central role in explaining neonatal size. Newborn sex also remains highly significant, with male newborns weighing on average more than females, even after adjustment for anthropometric measures. The number of pregnancies retains a positive and statistically significant effect, while maternal smoking remains non-significant in this multivariable context. The overall goodness-of-fit statistics remain virtually unchanged compared to the full model (R2 ≈ 0.729; Adjusted R2 ≈ 0.728). The residual standard error is also nearly identical, and the information criteria (AIC and BIC) do not indicate any substantial deterioration in model performance. This confirms that removing mother’s age does not compromise explanatory power and supports the decision to exclude it in pursuit of a more parsimonious specification.

# Backward stepwise removing Birth hospital

mod3 <- update(mod2,~.-Ospedale)

# Coefficient table

kable(
  round(summary(mod3)$coefficients, 3),
  caption = "Regression model after removing Birth Ospital")
(#tab:backward_remove_ospedale)Regression model after removing Birth Ospital
Estimate Std. Error t value Pr(>|t|)
(Intercept) -6708.809 136.064 -49.306 0.000
N.gravidanze 12.993 4.344 2.991 0.003
Fumatrici -31.882 27.580 -1.156 0.248
Gestazione 32.597 3.804 8.569 0.000
Lunghezza 10.268 0.301 34.098 0.000
Cranio 10.501 0.426 24.637 0.000
Tipo.partoNat 30.424 12.104 2.514 0.012
SessoM 78.103 11.200 6.974 0.000
# Model fit statistics

model_fit_table(
  mod3,
  caption = "Model fit statistics for the backward stepwise-selected: removing Birth Ospital")
(#tab:backward_remove_ospedale)Model fit statistics for the backward stepwise-selected: removing Birth Ospital
R_squared Adj_R_squared Residual_SE F_statistic AIC BIC
value 0.728 0.727 274.391 951.291 35149.32 35201.73


The regression model obtained after excluding the birth hospital variable shows results that remain highly consistent with the previous specification. The estimated coefficients for the core predictors—gestational duration, anthropometric measures, parity, type of delivery, and newborn sex—are virtually unchanged in magnitude and statistical significance. This stability indicates that hospital-related differences did not substantially confound the relationship between these variables and birthweight. From a goodness-of-fit perspective, the exclusion of the hospital variable produces only a negligible reduction in explanatory power (R2 ≈ 0.728; Adjusted R2 ≈ 0.727). The residual standard error remains essentially unchanged, indicating that predictive accuracy is preserved. Importantly, the information criteria (AIC and BIC) do not show meaningful deterioration compared to the previous model, supporting the decision to remove the hospital variable in favor of a more parsimonious specification. Overall, these results suggest that differences across hospitals do not materially affect newborn birthweight once individual clinical and biological factors are accounted for. The simplified model therefore achieves comparable explanatory performance while improving interpretability and parsimony.

# Backward stepwise removing Maternal smoking

mod4 <- update(mod3,~.-Fumatrici)

# Coefficient table

kable(
  round(summary(mod4)$coefficients, 3),
  caption = "Regression model after removing maternal smoking")
(#tab:backward_remove_fumatrici)Regression model after removing maternal smoking
Estimate Std. Error t value Pr(>|t|)
(Intercept) -6708.017 136.072 -49.298 0.000
N.gravidanze 12.736 4.339 2.935 0.003
Gestazione 32.325 3.797 8.514 0.000
Lunghezza 10.283 0.301 34.177 0.000
Cranio 10.506 0.426 24.648 0.000
Tipo.partoNat 30.160 12.103 2.492 0.013
SessoM 77.917 11.199 6.957 0.000
# Model fit statistics

model_fit_table(
  mod4,
  caption = "Model fit statistics for the backward stepwise-selected: removing maternal smoking")
(#tab:backward_remove_fumatrici)Model fit statistics for the backward stepwise-selected: removing maternal smoking
R_squared Adj_R_squared Residual_SE F_statistic AIC BIC
value 0.728 0.727 274.41 1109.467 35148.67 35195.25


The regression model obtained after excluding maternal smoking remains highly consistent with the previous specifications, confirming that smoking status did not provide a statistically significant contribution to the explanation of birthweight in the multivariable context. The estimated coefficients of the remaining predictors are virtually unchanged in magnitude and significance, indicating that maternal smoking was not acting as a relevant confounder in this dataset. From a model performance perspective, the exclusion of maternal smoking produces no meaningful loss in explanatory power. The R2 and adjusted R2 remain essentially unchanged (≈ 0.728 and 0.727, respectively), and the residual standard error is nearly identical to that observed in previous models. Moreover, the information criteria (AIC and BIC) show a slight improvement, supporting the decision to remove maternal smoking in favor of a more parsimonious model. Overall, these findings suggest that maternal smoking does not independently influence birthweight once gestational duration and anthropometric measures are accounted for. The reduced model therefore maintains strong explanatory performance while improving simplicity and interpretability.

# Backward stepwise removing Type of Delivery

mod5 <- update(mod4,~.-Tipo.parto)

# Coefficient table

kable(
  round(summary(mod5)$coefficients, 3),
  caption = "Regression model after removing type of delivery")
(#tab:backward_remove_tipo_parto)Regression model after removing type of delivery
Estimate Std. Error t value Pr(>|t|)
(Intercept) -6681.725 135.804 -49.201 0.000
N.gravidanze 12.455 4.342 2.869 0.004
Gestazione 32.383 3.801 8.520 0.000
Lunghezza 10.245 0.301 34.059 0.000
Cranio 10.541 0.426 24.717 0.000
SessoM 77.981 11.211 6.956 0.000
# Model fit statistics

model_fit_table(
  mod5,
  caption = "Model fit statistics for the backward stepwise-selected: removing type of delivery")
(#tab:backward_remove_tipo_parto)Model fit statistics for the backward stepwise-selected: removing type of delivery
R_squared Adj_R_squared Residual_SE F_statistic AIC BIC
value 0.727 0.726 274.697 1327.343 35152.89 35193.65


The regression model obtained after removing the type of delivery remains stable and retains strong explanatory power. The estimated effects of the remaining predictors—gestational duration, anthropometric measures, parity, and newborn sex—are virtually unchanged and remain highly statistically significant, indicating that the exclusion of delivery type does not materially affect the model structure. From a goodness-of-fit perspective, the reduction in model performance is minimal. The adjusted R2 decreases only marginally (from approximately 0.727 to 0.726), and the residual standard error remains close to previous values, suggesting that predictive accuracy is preserved. Although information criteria (AIC and BIC) show a slight increase, this change is modest and reflects the trade-off between model simplicity and fit. Overall, these results support the exclusion of delivery type in favor of a more parsimonious model, as this variable does not provide substantial additional explanatory power once key biological and gestational factors are accounted for.

# Backward stepwise removing Number of pregnancies

mod6 <- update(mod5,~.-N.gravidanze)

# Coefficient table

kable(
  round(summary(mod6)$coefficients, 3),
  caption = "Regression model after removing number of pregrancies")
(#tab:backward_remove_n_gravidanze)Regression model after removing number of pregrancies
Estimate Std. Error t value Pr(>|t|)
(Intercept) -6651.673 135.595 -49.055 0
Gestazione 31.326 3.788 8.269 0
Lunghezza 10.202 0.301 33.909 0
Cranio 10.671 0.425 25.126 0
SessoM 79.103 11.220 7.050 0
# Model fit statistics

model_fit_table(
  mod6,
  caption = "Model fit statistics for the backward stepwise-selected: number of pregrancies")
(#tab:backward_remove_n_gravidanze)Model fit statistics for the backward stepwise-selected: number of pregrancies
R_squared Adj_R_squared Residual_SE F_statistic AIC BIC
value 0.726 0.726 275.095 1652.329 35159.12 35194.06

The regression coefficients quantify how each explanatory variable affects newborn birth weight while holding the other variables constant.

Gestational duration shows a positive effect on birth weight, meaning that each additional week of pregnancy is associated with an increase in the expected newborn weight.

Anthropometric measurements such as newborn length and cranial diameter also appear to be strong predictors of birth weight, which is consistent with biological expectations.

Maternal behavioural and demographic variables may also contribute to explaining variability in birth weight, although their impact is generally smaller compared with gestational duration and anthropometric measurements.


After removing the number of previous pregnancies (N.gravidanze), the regression model remains strongly driven by gestational and anthropometric factors. Gestational duration, newborn length, head circumference, and newborn sex continue to show highly significant associations with birthweight, and their estimated effects remain stable in magnitude. From a goodness-of-fit perspective, the explanatory power of the model decreases only marginally (Adjusted R2 ≈ 0.726), and the residual standard error increases slightly. However, the information criteria (AIC and BIC) show a modest deterioration compared to the previous specification, suggesting that removing parity slightly reduces model performance. Overall, while the model without the number of pregnancies still explains a substantial proportion of the variability in birthweight, the slight worsening of fit indicators suggests that parity contributes additional explanatory information and may be retained in the final specification.

4.3.1 Model Selection


To identify the most appropriate regression specification, a model selection phase was conducted starting from the full multiple linear regression model. The goal is to obtain a parsimonious model that retains only predictors providing meaningful explanatory power, while avoiding unnecessary complexity. In this context, backward stepwise selection is applied, progressively removing predictors and evaluating the impact on overall model quality. Model comparison is performed using information criteria, namely the Akaike Information Criterion (AIC) and the Bayesian Information Criterion (BIC). Both criteria balance model fit and complexity, but BIC penalizes complexity more strongly and therefore tends to select simpler models. The selected model is subsequently evaluated through adjusted R2, residual standard error (RSE), ANOVA. To formally assess whether progressively simplified regression specifications lead to a statistically meaningful loss of explanatory power, a nested-model ANOVA (partial F-test) will be conducted across the sequence of candidate models. This procedure compares each reduced model to the preceding (more complex) specification under the assumption of nesting, and tests whether the exclusion of predictors results in a significant increase in the residual sum of squares. In this context, ANOVA provides an inferential complement to information-criterion approaches (AIC/BIC), allowing model simplification decisions to be supported by formal hypothesis testing.

anova_table <- anova(mod1,mod2,mod3,mod4,mod5,mod6)

rownames(anova_table) <- c("Model 1", "Model 2", "Model 3", "Model 4", "Model 5", "Model 6")

knitr::kable(anova_table, caption = "Confronto ANOVA")
(#tab:anova_table)Confronto ANOVA
Res.Df RSS Df Sum of Sq F Pr(>F)
Model 1 2487 186743194 NA NA NA NA
Model 2 2488 186779904 -1 -36710.0 0.4888948 0.4844861
Model 3 2490 187473818 -2 -693913.8 4.6206866 0.0099307
Model 4 2491 187574428 -1 -100610.4 1.3399046 0.2471620
Model 5 2492 188042054 -1 -467625.7 6.2277237 0.0126409
Model 6 2493 188663107 -1 -621053.3 8.2710355 0.0040625


Model 6 was selected as the best-performing specification after comparing all candidate models using information criteria (AIC/BIC), predictive accuracy metrics (e.g., RMSE), and diagnostic checks on residual behavior. Overall, Model 6 provides the most favorable trade-off between goodness-of-fit and parsimony: it achieves strong predictive performance without introducing unnecessary complexity that could compromise generalizability or interpretability.

4.3.2 Model Interaction

The interaction effect between gestational duration and maternal smoking is tested starting from the selected final model (Model 6), in order to preserve the logic of the model selection process. This specification allows us to assess whether the effect of gestational duration on birthweight differs between smoking and non-smoking mothers, without reintroducing predictors that had already been excluded in previous steps.

# Interaction added starting from the selected final model (mod6)
mod6_int <- update(mod6, . ~ . + Gestazione:Fumatrici)

# Comparison between mod6 and the interaction model
interaction_comp <- data.frame(
  Model = c("Model 6", "Model 6 + Gestazione:Fumatrici"),
  AIC = c(AIC(mod6), AIC(mod6_int)),
  BIC = c(BIC(mod6), BIC(mod6_int)),
  Adj_R2 = c(summary(mod6)$adj.r.squared,
             summary(mod6_int)$adj.r.squared)
)

knitr::kable(
  interaction_comp,
  digits =3, 
  caption = "Comparison between Model 6 and Model 6 with interaction"
)
Table 4.1: Comparison between Model 6 and Model 6 with interaction
Model AIC BIC Adj_R2
Model 6 35159.12 35194.06 0.726
Model 6 + Gestazione:Fumatrici 35160.12 35200.89 0.726
anova(mod6, mod6_int)
## Analysis of Variance Table
## 
## Model 1: Peso ~ Gestazione + Lunghezza + Cranio + Sesso
## Model 2: Peso ~ Gestazione + Lunghezza + Cranio + Sesso + Gestazione:Fumatrici
##   Res.Df       RSS Df Sum of Sq      F Pr(>F)
## 1   2493 188663107                           
## 2   2492 188587860  1     75247 0.9943 0.3188
summary(mod6_int)
## 
## Call:
## lm(formula = Peso ~ Gestazione + Lunghezza + Cranio + Sesso + 
##     Gestazione:Fumatrici, data = newborn_clean)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1138.71  -182.57   -17.38   163.25  2623.42 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          -6652.2044   135.5964 -49.059  < 2e-16 ***
## Gestazione              31.5608     3.7957   8.315  < 2e-16 ***
## Lunghezza               10.1883     0.3012  33.825  < 2e-16 ***
## Cranio                  10.6689     0.4247  25.122  < 2e-16 ***
## SessoM                  79.3031    11.2223   7.067 2.05e-12 ***
## Gestazione:Fumatrici    -0.7008     0.7028  -0.997    0.319    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 275.1 on 2492 degrees of freedom
## Multiple R-squared:  0.7262, Adjusted R-squared:  0.7257 
## F-statistic:  1322 on 5 and 2492 DF,  p-value: < 2.2e-16

The interaction between gestational duration and maternal smoking was tested starting from Model 6. The comparison between the two models shows that adding the interaction term does not improve the model fit. In particular, AIC and BIC slightly increase and the adjusted R² remains unchanged (0.726). The ANOVA test confirms that the interaction term is not statistically significant (p = 0.3188). Consistently, the coefficient associated with the interaction in the regression output is not significant. Therefore, there is no evidence that the effect of gestational duration on birth weight differs between smoking and non-smoking mothers. For this reason, the interaction term is not retained and Model 6 is kept as the final specification.

4.3.3 Multicollinearity Assessment


Before interpreting coefficients from Model 6, it is necessary to asses whether high multicollinearity is present among the predictors. Multicollinearity arises when two or more regressors are strongly correlated, which may inflate standard errors, destabilize coefficient estimates, and reduce interpretability (even if predictive accuracy remains acceptable). This diagnostic is especially important when the model includes anthropometric variables (e.g., length and cranial diameter), which are biologically related and can be correlated with each other and with gestational age. In order to quantify multicollinearity the Variance Inflation Factor (VIF) will be used. As a rule of thumb, VIF values above 5 suggest a potential concern, while values above 10 indicate severe multicollinearity that may require corrective actions such as removing or combining predictors, using regularization (e.g., ridge regression), or adopting dimension reduction strategies.

knitr::kable(
  vif(mod6),
  caption = "Multicollinearity among regressors"
)
Table 4.2: Multicollinearity among regressors
x
Gestazione 1.654101
Lunghezza 2.070582
Cranio 1.606316
Sesso 1.038918


All predictors exhibit VIF values close to 1, with the highest value observed for Length (VIF = 2.07), followed by Gestational Age (VIF = 1.65) and Cranial Diameter (VIF = 1.61). The variable Sex shows a VIF very close to 1 (VIF = 1.04). These values are well below the commonly accepted thresholds of 5 (moderate concern) and 10 (serious multicollinearity), indicating that no problematic collinearity exists among the regressors. Although anthropometric variables are biologically related and expected to be correlated, the observed levels of association do not inflate the variance of the coefficient estimates to a concerning degree. Therefore, the regression coefficients in Model 6 can be considered stable and reliable, and no corrective measures (such as variable removal, transformation, or regularization techniques) are required at this stage.

# R-squared
r2  <- summary(mod6)$r.squared
adj_r2 <- summary(mod6)$adj.r.squared

# RMSE
rmse <- sqrt(mean(residuals(mod6)^2))

# Create summary table
fit_table <- data.frame(
  Metric = c("R-squared",
             "Adjusted R-squared",
             "RMSE (grams)"),
  Value = round(c(r2, adj_r2, rmse), 4)
)

kable(fit_table,
      caption = "Model 6 – Goodness of Fit Metrics",
      align = "c")
(#tab:goodness_of_fit)Model 6 – Goodness of Fit Metrics
Metric Value
R-squared 0.7261
Adjusted R-squared 0.7257
RMSE (grams) 274.8193

4.4 Model Quality Evaluation

Once the final regression model has been selected, it is important to evaluate its predictive quality and verify whether the assumptions of the linear regression model are reasonably satisfied.

In particular we analyse: - the overall goodness of fit of the model through the R² value - the residual distribution - the presence of influential observations that could distort the results

# Summary of the final regression model
summary(mod6)
## 
## Call:
## lm(formula = Peso ~ Gestazione + Lunghezza + Cranio + Sesso, 
##     data = newborn_clean)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1138.1  -184.4   -17.4   163.3  2626.7 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -6651.6732   135.5952 -49.055  < 2e-16 ***
## Gestazione     31.3262     3.7884   8.269  < 2e-16 ***
## Lunghezza      10.2024     0.3009  33.909  < 2e-16 ***
## Cranio         10.6706     0.4247  25.126  < 2e-16 ***
## SessoM         79.1027    11.2205   7.050 2.31e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 275.1 on 2493 degrees of freedom
## Multiple R-squared:  0.7261, Adjusted R-squared:  0.7257 
## F-statistic:  1652 on 4 and 2493 DF,  p-value: < 2.2e-16
# Diagnostic plots
par(mfrow=c(2,2))
plot(mod6)

The diagnostic plots help evaluate the assumptions of linear regression.
The residuals vs fitted plot is used to assess linearity and homoscedasticity, while the normal Q-Q plot provides a visual indication of whether residuals follow an approximately normal distribution.

Small deviations are common in real datasets and do not necessarily invalidate the predictive usefulness of the model.

5 Predictions

After validating the regression model, we can use it to generate predictions for new observations.

For example, we estimate the expected birth weight of a newborn with the following characteristics:

  • mother age: 32 years
  • third pregnancy
  • non-smoking mother
  • gestational duration: 39 weeks
  • newborn sex: female
new_case <- data.frame(
  Anni.madre = 32,
  N.gravidanze = 3,
  Fumatrici = 0,
  Gestazione = 39,
  Lunghezza = 500,
  Cranio = 340,
  Tipo.parto = "Nat",
  Ospedale = "osp1",
  Sesso = "F"
)

predicted_weight <- predict(mod6, newdata = new_case)
mean_weight <- mean(newborn_clean$Peso)

prediction_table <- data.frame(
  "Predicted weight (g)" = round(as.numeric(predicted_weight), 2),
  "Average w (g)" = round(mean_weight, 2)
)

knitr::kable(
  prediction_table,
  col.names = c("Predicted weight (g)", "Average weight(g)"),
  caption = "Prediction result",
  align = "c"
)
(#tab:prediction_example)Prediction result
Predicted weight (g) Average weight(g)
3299.27 3284.18

The table compares the predicted birth weight for the hypothetical newborn with the average birth weight observed in the dataset. A prediction close to the sample mean suggests that the selected characteristics describe a fairly typical birth scenario according to the estimated model.

6 Visualizations

Visual representations are useful for understanding the relationships between the main predictors and newborn weight.

6.0.1 Gestational duration, sex and birth weight

ggplot(data = newborn_clean) +
  geom_point(aes(x = Gestazione,
                 y = Peso,
                 col = Sesso),
             position = "jitter") +
  geom_smooth(aes(x = Gestazione,
                  y = Peso,
                  col = Sesso),
              se = FALSE,
              method = "lm") +
  labs(title = "Impact of gestational weeks and sex on newborn weight",
       x = "Gestational Weeks",
       y = "Birth Weight (grams)",
       colour = "Sex") +
  theme_classic()
## `geom_smooth()` using formula = 'y ~ x'

The plot highlights a clear positive relationship between gestational duration and newborn birth weight for both sexes. The separation between the two fitted lines also suggests that sex may contribute to differences in birth weight, with one group tending to have slightly higher predicted values across gestational weeks.

6.0.2 Maternal smoking and birth weight

ggplot(data = newborn_clean) +
  geom_boxplot(aes(x = factor(Fumatrici),
                   y = Peso)) +
  labs(title = "Birth weight by maternal smoking status",
       x = "Maternal Smoking",
       y = "Birth Weight (grams)") +
  scale_x_discrete(labels = c("0" = "No", "1" = "Yes")) +
  theme_classic()

The boxplot compares birth weight between smoking and non-smoking mothers. Differences between the two groups may indicate that maternal smoking has a negative impact on fetal growth, leading to lower birth weights on average.

7 Conclusions

The objective of this project was to analyse clinical data from three hospitals in order to identify the main factors influencing newborn birth weight and to build a predictive statistical model.
The exploratory analysis allowed us to understand the structure of the dataset and detect potential anomalies in the data. Several hypothesis tests were performed to evaluate differences between hospitals, verify whether sample averages differ from population values, and assess anthropometric differences between male and female newborns.
A multiple linear regression model was then developed to quantify the relationship between birth weight and several explanatory variables, including maternal characteristics, gestational duration and neonatal measurements. Model selection techniques were applied in order to identify a parsimonious model with good predictive performance.
The results highlight the importance of gestational duration and newborn anthropometric measurements as key predictors of birth weight. These findings are consistent with existing medical knowledge and confirm the usefulness of statistical modelling in supporting neonatal healthcare analysis.