Project introduction

The project develops a statistical model to predict newborn birth weight using data from three hospitals. The goal is to improve high-risk pregnancy care, optimize resources, and reduce neonatal complications through timely intervention and risk factor identification.

Load dataset

The dataset contains continuous numeric variables (birth weight, length, head circumference), discrete numeric variables (maternal age, pregnancies, gestational age), and nominal qualitative variables (maternal smoking, delivery type, hospital, infant sex). Birth weight is the target variable.

#Load dataset
data <- read.csv("C:/Users/matmi/OneDrive/Desktop/ProgettiR/neonati.csv")

attach(data)

#Factorize qualitative variables
data$Fumatrici <- factor(data$Fumatrici, levels = c(0,1), labels = c("NonFumatrici", "Fumatrici"))
data$Tipo.parto <- factor(data$Tipo.parto, levels = c("Nat", "Ces"))
data$Ospedale <- factor(data$Ospedale, levels = c("osp1", "osp2", "osp3"))
data$Sesso <- factor(data$Sesso, levels = c("M", "F"))

Dataset descriptive analysis

In this phase the variables of the dataset are analyzed based on their typology.

Continue quantitative variables

The distributions are consistent with clinical expectations and show similar asymmetries between anthropometric variables. This suggests correlation of biological data.

#Continue quantitative variables of dataset
cont_vars <- c("Peso", "Lunghezza", "Cranio")

#Function for statistical index calculation
calc_stats <- function(x) {
  c(
    mean = round(mean(x, na.rm = TRUE), 2),
    median = round(median(x, na.rm = TRUE), 2),
    sd = round(sd(x, na.rm = TRUE), 2),
    skew = round(skewness(x, na.rm = TRUE), 2),
    min = round(min(x, na.rm = TRUE), 2),
    max = round(max(x, na.rm = TRUE), 2)
  )
}
print_stat_summary <- function(data, vars, caption) {
  summary_mat <- sapply(data[, vars], calc_stats)
  kable(summary_mat, caption = caption)
}

print_stat_summary(data, cont_vars, "Matrix of statistical index for continuous quantitative variables")
Matrix of statistical index for continuous quantitative variables
Peso Lunghezza Cranio
mean 3284.08 494.69 340.03
median 3300.00 500.00 340.00
sd 525.04 26.32 16.43
skew -0.65 -1.51 -0.79
min 830.00 310.00 235.00
max 4930.00 565.00 390.00

Normality test and analysis for target variable (Peso)

The target variable (Weight) is tested for normality. The Shapiro-Wilk test rejects normality, as expected in large samples. The Q-Q plot shows approximate symmetry. Since linear regression requires normality of the residuals, which will be assessed during modeling, it is considered appropriate to approach this modeling.

#Investigate normality for target variable (Peso)

#Shapiro-Wilk test 
shapiro_test <- shapiro.test(Peso)

shapiro_df <- data.frame(
  Test = "Shapiro-Wilk",
  Statistic = round(shapiro_test$statistic, 2),
  P_Value = round(shapiro_test$p.value, 3),
  Conclusion = ifelse(shapiro_test$p.value > 0.05,
                      "Fail to reject H0 (normality)",
                      "Reject H0 (not normal)")
)

#Shapiro-Wilk test results for target variable
kable(shapiro_df, caption = "Results of Shapiro-Wilk test on target variable", row.names = TRUE)
Results of Shapiro-Wilk test on target variable
Test Statistic P_Value Conclusion
W Shapiro-Wilk 0.97 0 Reject H0 (not normal)
#Q-Q plot
ggplot(data, aes(sample = Peso)) +
  stat_qq(color = "blue") + stat_qq_line(color = "red") +
  labs(title = "Q-Q plot of birth weight (Peso)",
          x = "Theoretical quantiles",
          y = "Sample quantiles") +
  theme_classic()

Discrete quantitative variables

Outliers with maternal age close to zero are observed; these are considered errors and removed during the cleaning phase. The distributions reflect the expected clinical characteristics: maternal age around 30 years, pregnancies skewed toward low values, and gestational age with a left tail for premature births.

#Discrete quantitative variables of dataset
disc_vars <- c("Anni.madre", "N.gravidanze", "Gestazione")

#Histogram plots
plot_multi_hist <- function(data, vars) {
  plots <- lapply(vars, function(v) {
    ggplot(data, aes_string(x = v)) + 
      geom_bar(fill = "blue", color = "black") +
      labs(title = paste("Freq. of", v)) +
      theme_classic()
  })
  gridExtra::grid.arrange(grobs = plots, nrow = 1)
}

plot_multi_hist(data, disc_vars)

Qualitative nominal variables

The prevalence of smoking is low, and the gender distribution is balanced. Maternal smoking is considered a risk factor, while the hospital and delivery method are variables that are not expected to be clinically relevant and are not analyzed.

#Function to calculate contingency table
desc_qual_df <- function(v) {
  tab <- table(data[[v]])
  perc <- round(100 * prop.table(tab), 1)
  df <- data.frame(
    Variable = v,
    Level = names(tab),
    Count = as.numeric(tab),
    Percentage = perc,
    stringsAsFactors = FALSE
  )
  
  return(
    kable(
      cbind(df[, 1:3], df[,5]),
      caption = paste("Frequency distribution of variable", v),
      col.names = c("Variable", "Level", "Count", "Percentage")
    ) %>%
    kable_styling(full_width = FALSE, position = "left")
  )
}

#Contingency table results for variable Fumatrici
desc_qual_df("Fumatrici")
Frequency distribution of variable Fumatrici
Variable Level Count Percentage
Fumatrici NonFumatrici 2396 95.8
Fumatrici Fumatrici 104 4.2
#Contingency table results for variable Sesso
desc_qual_df("Sesso")
Frequency distribution of variable Sesso
Variable Level Count Percentage
Sesso M 1244 49.8
Sesso F 1256 50.2

Outlier analysis

In this section, we analyze, through tabular displays, the outliers of the variables with particular clinical significance and with a probable greater influence on the analysis.

#Function to calculate outliers
calc_outliers <- function(x, varname = "variable") {
  Q1 <- quantile(x, 0.25, na.rm = TRUE)
  Q3 <- quantile(x, 0.75, na.rm = TRUE)
  IQR <- Q3 - Q1
  lower_bound <- Q1 - 1.5 * IQR
  upper_bound <- Q3 + 1.5 * IQR
  which_outliers_low <- which(x < lower_bound)
  which_outliers_up  <- which(x > upper_bound)
  
 if(length(which_outliers_low) + length(which_outliers_up) == 0) {
    df <- data.frame(
          Variable = varname,
          Number_low_outliers = 0,
          Number_high_outliers = 0,
          Percent_low_outliers = 0,
          Percent_high_outliers = 0,
          Conclusion = "No outliers detected",
          stringsAsFactors = FALSE
          )
  } else {
  n_outliers_low <- length(which_outliers_low)  
  n_outliers_up <- length(which_outliers_up)
  total <- sum(!is.na(x))
  percent_outliers_low <- 100 * n_outliers_low / total 
  percent_outliers_up <- 100* n_outliers_up / total
  
  conclusion <- ifelse(percent_outliers_low + percent_outliers_up > 5, "High number of outliers", "Acceptable number of outliers")
  
  df <- data.frame(
      Variable = varname,
          Number_low_outliers = n_outliers_low,
          Number_high_outliers = n_outliers_up,
          Percent_low_outliers = percent_outliers_low,
          Percent_high_outliers = percent_outliers_up,
          Conclusion = conclusion,
          stringsAsFactors = FALSE
  )
}

  return(
    kable(df, caption = paste("Table of number, percentage and conclusion for outliers of variable", varname)) %>%
      kable_styling(full_width = FALSE, position = "left")
  )
}

Outliers are evaluated primarily for the target variable (Peso).

#Calculate outliers of Peso
calc_outliers(data[, "Peso"], varname = "Peso")
Table of number, percentage and conclusion for outliers of variable Peso
Variable Number_low_outliers Number_high_outliers Percent_low_outliers Percent_high_outliers Conclusion
Peso 55 14 2.2 0.56 Acceptable number of outliers

Outliers of anthropometric measurements are checked to verify the concordance of their distributions with the target variable.

#Calculate outliers of Lunghezza
calc_outliers(data[, "Lunghezza"], varname = "Lunghezza")
Table of number, percentage and conclusion for outliers of variable Lunghezza
Variable Number_low_outliers Number_high_outliers Percent_low_outliers Percent_high_outliers Conclusion
Lunghezza 56 3 2.24 0.12 Acceptable number of outliers
#Calculate outliers of Cranio
calc_outliers(data[, "Cranio"], varname = "Cranio")
Table of number, percentage and conclusion for outliers of variable Cranio
Variable Number_low_outliers Number_high_outliers Percent_low_outliers Percent_high_outliers Conclusion
Cranio 37 11 1.48 0.44 Acceptable number of outliers

Outliers also occur for gestational weeks, potentially clinically influential on weight.

#Calculate outliers of Gestazione
calc_outliers(data[, "Gestazione"], varname = "Gestazione")
Table of number, percentage and conclusion for outliers of variable Gestazione
Variable Number_low_outliers Number_high_outliers Percent_low_outliers Percent_high_outliers Conclusion
Gestazione 67 0 2.68 0 Acceptable number of outliers

Similar anomalous trends are found in anthropometric measurements and gestational age, with mostly low values, likely corresponding to premature births. The absence of high anomalous values for gestational age, combined with high measurements, suggests a saturation effect, which should be explored in the modeling.

Anomal values elimination

Maternal age values <10, biologically implausible, are removed without impacting dataset quality.

#Function to find and remove error in variable Anni.Madre
remove_rows_condition <- function(data, var, condition_func) {
  rows_to_remove <- which(condition_func(data[[var]]))
  data_filtered <- data[-rows_to_remove, ]
  return(data_filtered)
}

data <- remove_rows_condition(data, "Anni.madre", function(x) x < 10)

Test for specific clinical hypotesis

In the second phase of the analysis, specific hypotheses of clinical and statistical relevance are tested to better understand the relationships between qualitative and quantitative variables in the dataset.

Test for differece of Tipo.parto among Ospedale

The test shows that there is no evidence supporting significant differences in the frequency of cesarean deliveries between hospitals. Therefore, in the predictive modeling of neonatal weight, there is no statistical justification to include the Hospital variable or its indirect effects through Type of delivery.

#Function to calculate contingency table for Tipo.parto and Ospedale
calc_tab_cont <- function(var1, var2) {
  cont_tab <- table(var1, var2)
  return(cont_tab)
}

cont_tab <- calc_tab_cont(data$Tipo.parto, data$Ospedale)

kable(cont_tab, caption = "Contingency table for variables Tipo.parto and Ospedale")
Contingency table for variables Tipo.parto and Ospedale
osp1 osp2 osp3
Nat 574 594 602
Ces 242 254 232
#Chi-squared test execution
result_t1 <- chisq.test(cont_tab)
test_ces_df <- data.frame(
  Method = result_t1$method,
  Statistic = round(as.numeric(result_t1$statistic), 2),
  P_Value = round(result_t1$p.value, 3)
)

kable(test_ces_df, caption = "Result of chi-squared test on on the contingency table between Tipo.parto and Ospedale")
Result of chi-squared test on on the contingency table between Tipo.parto and Ospedale
Method Statistic P_Value
Pearson’s Chi-squared test 1.08 0.582

Test for difference of Peso and Lunghezza from population values

Peso and Lunghezza are compared to the 2024 Cedap dataset reference values. Given the continuous nature of the variables, the large sample size, and only minor deviations from normality, is used the one-sample t-test.

#Function to test Peso and Lunghezza then same population variable
compare_to_population <- function(x, mu, alpha = 0.05, varname = "variable") {
  x <- na.omit(x)
  
  #T test
  t_res <- t.test(x, mu = mu)
  
  #Summary of results
  result <- data.frame(
      parametric_test = "One-sample t-test",
      p_value_ptest = round(t_res$p.value, 3),
      conclusion_ptest = ifelse(t_res$p.value < alpha,
                          "Reject H0. Target variable is statistically different from population variable",
                          "Fail to Reject H0. Target variable is not statistically different from population variable"),
      stringsAsFactors = FALSE    
    )
  
    return(
    kable(result,
          caption = paste("Summary table of results for comparison between", varname, "of sample and population value"),
          col.names = c("Parametric test", "P-value", "Conclusion")) %>%
      kable_styling(full_width = FALSE, position = "left")
  )
 }

#Test for weight and length
weight_ref <- 3300
length_ref <- 500

compare_to_population(data$Peso, mu = weight_ref, varname = "Peso")
Summary table of results for comparison between Peso of sample and population value
Parametric test P-value Conclusion
One-sample t-test 0.132 Fail to Reject H0. Target variable is not statistically different from population variable
compare_to_population(data$Lunghezza, mu = length_ref, varname = "Lunghezza")
Summary table of results for comparison between Lunghezza of sample and population value
Parametric test P-value Conclusion
One-sample t-test 0 Reject H0. Target variable is statistically different from population variable

Weight shows no significant differences from the reference mean (3300 g). Length, however, shows a significant difference from the reference value (500 mm). The discrepancy in length could reflect specific effects emerging from the data set, investigated during the modeling phase.

Test for difference of anthropometric variables among Sesso

In this section are tested differences between anthropometric variables among Sesso. Normality of the variables within groups is considered satisfied cause the large size of sample. Homogeneity of variances is tested with Leven test. Since the variables (Peso, Lunghezza, Cranio) are continuous quantitative measures with approximately normal distributions, parametric tests can be used to assess differences between Sex groups.

cat_for_test <- c("Sesso")
num_for_test <- c("Peso", "Lunghezza", "Cranio")

test_group_diff <- function(data, num_var, cat_var) {
  x <- data[[num_var]]
  g <- data[[cat_var]]
  
  df <- data.frame(x = x, g = g) %>% filter(!is.na(x), !is.na(g))
  
  n_groups <- length(unique(df$g))
  
  #Omogeneity variance test (Levene)
  levene_p <- car::leveneTest(x ~ g, data = df)$"Pr(>F)"[1]
  variance_homogeneous <- (levene_p > 0.05)
  
  if (n_groups == 2) {
    if (variance_homogeneous) {
      test <- t.test(x ~ g, data = df, var.equal = TRUE)
      test_name <- "t-test (equal variance)"
    } else {
      test <- t.test(x ~ g, data = df, var.equal = FALSE)
      test_name <- "Welch t-test (different variance)"
    }
  } else {
    return(NA)
  }
  
  param_pvalue <- test$p.value
  
  return(list(
    test = test_name,
    param_pvalue = param_pvalue
  ))
}

#Return of test results
results_tests <- expand.grid(num_var = num_for_test, cat_var = cat_for_test, stringsAsFactors = FALSE)

results_tests <- results_tests %>%
  rowwise() %>%
  mutate(test_result = list(test_group_diff(data, num_var, cat_var))) %>%
  mutate(
  test = test_result$test,
  param_pvalue = round(test_result$param_pvalue, 3),
  param_significance = case_when(
    param_pvalue < 0.001 ~ "***",
    param_pvalue < 0.01  ~ "**",
    param_pvalue < 0.05  ~ "*",
    TRUE                 ~ ""
  )) %>%
  ungroup() %>%
  dplyr::select(num_var, cat_var, test, param_pvalue, param_significance)

#Print of results for Peso, Lunghezza and Cranio among Sesso
kable(
  results_tests[1:3,],
  caption = "Statistical test results for differences beetwen groups of variable Sesso",
  col.names = c("Numerical variable", "Categorical variable", "Parametric test", "P-value pametric test", "Significance parametric test")
)
Statistical test results for differences beetwen groups of variable Sesso
Numerical variable Categorical variable Parametric test P-value pametric test Significance parametric test
Peso Sesso t-test (equal variance) 0 ***
Lunghezza Sesso Welch t-test (different variance) 0 ***
Cranio Sesso t-test (equal variance) 0 ***

The results show that for all three anthropometric variables analyzed there are statistically significant differences between the groups defined by sex. These results suggest that the variable Sesso is an important discriminative factor to include as a covariate.

Correlation analysis

This section performs a correlation analysis among the main variables of biological interest.

#Variable for correlation analysis
vars_corr <- c("Peso", "Lunghezza", "Cranio")

#Plot of correlation matrix for anthropometric variables
ggpairs(data, columns = vars_corr,  
        aes(color = Fumatrici),
        lower = list(continuous = wrap("smooth_loess", alpha = 0.3, size = 0.5)),  
        upper = list(continuous = wrap("cor", size = 5)),  
        diag = list(continuous = wrap("densityDiag"))) +  
  theme_classic() +  
  theme(axis.text.x = element_text(angle = 60, hjust = 1)) +  
  ggtitle("Correlation matrix colored by Fumatrici")

Anthropometric measurements are highly correlated, indicating the biological coherence of the sample. Maternal smoking appears to affect length more than the other measurements.

#Function for scatterplot
plot_scatter <- function(data, wrap_var) {
  facet_formula <- as.formula(paste("~", wrap_var))
  ggplot(data, aes(x = Gestazione, y = Peso)) +
    geom_point() +
    geom_smooth(method = "loess") +
    theme_classic() +
    facet_wrap(facet_formula) +
    ggtitle(paste("Relationship between Peso and Gestazione by", wrap_var)) +
    labs(x = "Gestazione", y = "Peso")
}

#Plot of relationship between Peso and Gestazione among Fumatrici
plot_scatter(data, "Fumatrici")

Clustering by maternal smoking status highlights a possible smoking-associated alteration in fetal growth, which should be taken into account in the multivariate model.

#Plot of relationship between Peso and Gestazione among Sesso
plot_scatter(data, "Sesso")

Gender influences growth rate, with males showing faster weight gain. The relationship between weight and gestational age is not perfectly linear, confirming the saturation effect.

Gestational age, sex, maternal smoking, maternal characteristics, and anthropometric variables will be included as predictors of birth weight for the initial model.

Regression model

First Regression model

The initial linear regression model includes biologically relevant variables for predicting newborn birth weight, excluding delivery type and hospital variables, which lack biological significance.

#First model with variable biologically significant
mod1 <- lm(Peso ~ Lunghezza + Cranio + Gestazione + Anni.madre + N.gravidanze + Fumatrici + Sesso, data = data)

#Function for regression model coefficients
coef_fun <- function(mod) {
  summary_mod <- summary(mod)
  coef_tab <- as.data.frame(summary_mod$coefficients)
  coef_tab$p_value <- coef_tab$`Pr(>|t|)`
  
  coef_tab$Significance <- with(coef_tab, ifelse(
    p_value < 0.001, "***",
    ifelse(p_value < 0.01, "**",
           ifelse(p_value < 0.05, "*", ""))
  ))
  
  coef_tab <- coef_tab[, c("Estimate", "Std. Error", "t value", "p_value", "Significance")]
  coef_tab[,1:3] <- round(coef_tab[,1:3], 2)
  coef_tab$p_value <- round(coef_tab$p_value, 3)
  
  coef_tab_kable <- kable(coef_tab, caption = "Coefficient summary of initial regression model") %>%
  kable_styling(full_width = FALSE, position = "left")
  
  return(coef_tab_kable)
}  

#Print of regression model coefficients of the first model
coef_fun(mod1)
Coefficient summary of initial regression model
Estimate Std. Error t value p_value Significance
(Intercept) -6634.16 143.24 -46.31 0.000 ***
Lunghezza 10.23 0.30 33.98 0.000 ***
Cranio 10.52 0.43 24.63 0.000 ***
Gestazione 32.95 3.83 8.61 0.000 ***
Anni.madre 0.88 1.15 0.77 0.444
N.gravidanze 11.38 4.68 2.43 0.015
FumatriciFumatrici -30.40 27.61 -1.10 0.271
SessoF -78.08 11.21 -6.96 0.000 ***
#Function to verify multicollinearity
vif_fun <- function(mod) {
  vif_values <- vif(mod)
  #VIF table
  if (is.matrix(vif_values)) {
    vif_tab <- data.frame(
      Variable = rownames(vif_values),
      GVIF = round(vif_values[,"GVIF"], 2),
      GVIF_corr = round(vif_values[,"GVIF^(1/(2*Df))"], 2)
    )
  } else {
    vif_tab <- data.frame(
      Variable = names(vif_values),
      VIF = round(vif_values, 2)
    )
  }

  vif_kable <- kable(vif_tab, caption = "Variance Inflation Factor (VIF)")%>%
          kableExtra::kable_styling(full_width = FALSE, position = "left")
  
  return(vif_kable)
}

In the initial specification, anthropometric measurements and gestational age are strong positive predictors of birth weight, while female newborns have a lower mean birth weight than males. Maternal age and smoking status do not appear to significantly influence the target variable.

#Print of VIF table
vif_fun(mod1)
Variance Inflation Factor (VIF)
Variable VIF
Lunghezza Lunghezza 2.08
Cranio Cranio 1.63
Gestazione Gestazione 1.69
Anni.madre Anni.madre 1.19
N.gravidanze N.gravidanze 1.19
Fumatrici Fumatrici 1.01
Sesso Sesso 1.04

Given the observed correlation between the anthropometric measurements, multicollinearity is assessed and found not to be problematic based on the VIF values.

Backward approach for chose best model

Variable skimming

Maternal age is removed as it is not statistically significant and the comparison between the models confirms that its exclusion does not worsen the model fit.

#Second model with variable Anni.madre elimination
mod2 <- update(mod1, ~. -Anni.madre)

#Print of regression model coefficients of the second model
coef_fun(mod2)
Coefficient summary of initial regression model
Estimate Std. Error t value p_value Significance
(Intercept) -6604.10 137.75 -47.94 0.000 ***
Lunghezza 10.23 0.30 33.98 0.000 ***
Cranio 10.54 0.43 24.71 0.000 ***
Gestazione 32.64 3.81 8.57 0.000 ***
N.gravidanze 12.70 4.35 2.92 0.004 **
FumatriciFumatrici -30.57 27.60 -1.11 0.268
SessoF -78.16 11.21 -6.97 0.000 ***
#Function to compare results of nested models
anova_fun <- function(mod_simple, mod_complex) {
  terms_simple <- attr(terms(mod_simple), "term.labels")
  terms_complex <- attr(terms(mod_complex), "term.labels")
  
  is_nested <- all(terms_simple %in% terms_complex)
  
  if (!is_nested) {
    cat("Models are NOT nested. ANOVA test is not possible.\n")
    return(NULL)
  }
  
  anova_res <- round(anova(mod_simple, mod_complex), 2)
  
  anova_df <- as.data.frame(anova_res) %>%
  mutate(Model = c("More simple model", "More complex model")) %>%
  dplyr::select(Model, everything())
  
  anova_kable <- kable(anova_df, caption = "Test ANOVA results", col.names = c("Model", "Resid.Df", "Residual Sum Sq", "Df (Diff)", "Sum of Sq (Diff)", "F value", "P-value")) %>%
  kableExtra::kable_styling(full_width = FALSE, position = "left")
  
  return(anova_kable)
}

#Print of ANOVA test results
anova_fun(mod2, mod1)
Test ANOVA results
Model Resid.Df Residual Sum Sq Df (Diff) Sum of Sq (Diff) F value P-value
More simple model 2491 187949505 NA NA NA NA
More complex model 2490 187905214 1 44291.73 0.59 0.44

Second-order effects

Based on the exploratory analysis, a potential interaction between maternal smoking and gestational age is tested. The interaction did not improve model fit according to AIC and BIC criteria and was therefore excluded.

#Model with interaction between Fumatrici and Gestazione
data$Gestazione_c <- scale(data$Gestazione, center = TRUE, scale = FALSE)

mod3 <- lm(Peso ~ Lunghezza + Cranio + N.gravidanze + Fumatrici*Gestazione_c + Sesso, data = data)

#Print of regression model coefficients of the third model
coef_fun(mod3)
Coefficient summary of initial regression model
Estimate Std. Error t value p_value Significance
(Intercept) -5326.80 154.98 -34.37 0.000 ***
Lunghezza 10.23 0.30 33.96 0.000 ***
Cranio 10.53 0.43 24.69 0.000 ***
N.gravidanze 12.75 4.35 2.93 0.003 **
FumatriciFumatrici -24.70 28.12 -0.88 0.380
Gestazione_c 33.20 3.84 8.64 0.000 ***
SessoF -78.74 11.22 -7.02 0.000 ***
FumatriciFumatrici:Gestazione_c -21.05 19.28 -1.09 0.275
#Function to test with method AIC and BIC
aic_bic_fun <- function(mod_prev, mod_new) {
  aic_tab <- round(AIC(mod_prev, mod_new), 2)
  bic_tab <- round(BIC(mod_prev, mod_new), 2)
  
  aic_kable <- knitr::kable(aic_tab, caption = "Method AIC for best model chose") %>%
          kableExtra::kable_styling(full_width = FALSE, position = "left")
  bic_kable <- knitr::kable(bic_tab, caption = "Method BIC for best model chose") %>%
          kableExtra::kable_styling(full_width = FALSE, position = "left")
  
  return(list(AIC = aic_kable, BIC = bic_kable))
}

#Print of AIC and BIC test
aicbic_tables <- aic_bic_fun(mod2, mod3)
aicbic_tables$AIC
Method AIC for best model chose
df AIC
mod_prev 8 35153.66
mod_new 9 35154.46
aicbic_tables$BIC
Method BIC for best model chose
df BIC
mod_prev 8 35200.24
mod_new 9 35206.87

In line with the exploratory analysis, the interaction between maternal smoking and newborn length is retained, as it is statistically significant and improves model fit without introducing multicollinearity.

#Model with interaction between Fumatrici and Lunghezza
data$Lunghezza_c <- scale(data$Lunghezza, center = TRUE, scale = FALSE)

mod4 <- lm(Peso ~ Gestazione + Cranio + N.gravidanze + Fumatrici*Lunghezza_c + Sesso, data = data)

#Print of regression model coefficients of the fourth model
coef_fun(mod4)
Coefficient summary of initial regression model
Estimate Std. Error t value p_value Significance
(Intercept) -1556.86 191.92 -8.11 0.000 ***
Gestazione 32.98 3.81 8.66 0.000 ***
Cranio 10.54 0.43 24.73 0.000 ***
N.gravidanze 12.98 4.34 2.99 0.003 **
FumatriciFumatrici -23.13 27.76 -0.83 0.405
Lunghezza_c 10.14 0.30 33.42 0.000 ***
SessoF -76.95 11.21 -6.86 0.000 ***
FumatriciFumatrici:Lunghezza_c 2.98 1.28 2.33 0.020
#Print of AIC and BIC test
aicbic_tables <- aic_bic_fun(mod3, mod4)
aicbic_tables$AIC
Method AIC for best model chose
df AIC
mod_prev 9 35154.46
mod_new 9 35150.21
aicbic_tables$BIC
Method BIC for best model chose
df BIC
mod_prev 9 35206.87
mod_new 9 35202.62
#Print of VIF table
vif_fun(mod4)
Variance Inflation Factor (VIF)
Variable VIF
Gestazione Gestazione 1.68
Cranio Cranio 1.62
N.gravidanze N.gravidanze 1.03
Fumatrici Fumatrici 1.02
Lunghezza_c Lunghezza_c 2.12
Sesso Sesso 1.04
Fumatrici:Lunghezza_c Fumatrici:Lunghezza_c 1.05

Final model

A nonlinear effect of gestational age is introduced to explain the observed pattern of saturation in birth weight.

A slight improvement in AIC is observed, suggesting greater accuracy, while BIC, which further penalizes model complexity, deteriorates marginally. The nonlinear interaction is significant and clinically relevant, and therefore it is decided to retain it. The VIF values do not indicate multicollinearity among the included variables.

#Model with non linear term for Gestazione
mod5 <- lm(Peso ~ Gestazione_c + I(Gestazione_c^2) + Cranio + N.gravidanze + Fumatrici*Lunghezza_c + Sesso, data = data)

#Print of regression model coefficients of the fifth model
coef_fun(mod5)
Coefficient summary of initial regression model
Estimate Std. Error t value p_value Significance
(Intercept) -308.18 145.85 -2.11 0.035
Gestazione_c 37.26 4.30 8.66 0.000 ***
I(Gestazione_c^2) 1.41 0.66 2.13 0.033
Cranio 10.63 0.43 24.84 0.000 ***
N.gravidanze 13.04 4.34 3.00 0.003 **
FumatriciFumatrici -22.41 27.75 -0.81 0.419
Lunghezza_c 10.24 0.31 33.36 0.000 ***
SessoF -74.93 11.25 -6.66 0.000 ***
FumatriciFumatrici:Lunghezza_c 2.81 1.28 2.20 0.028
#Print of AIC and BIC test
aicbic_tables <- aic_bic_fun(mod4, mod5)
aicbic_tables$AIC
Method AIC for best model chose
df AIC
mod_prev 9 35150.21
mod_new 10 35147.65
aicbic_tables$BIC
Method BIC for best model chose
df BIC
mod_prev 9 35202.62
mod_new 10 35205.89
#Print of VIF table
vif_fun(mod5)
Variance Inflation Factor (VIF)
Variable VIF
Gestazione_c Gestazione_c 2.15
I(Gestazione_c^2) I(Gestazione_c^2) 1.83
Cranio Cranio 1.64
N.gravidanze N.gravidanze 1.03
Fumatrici Fumatrici 1.02
Lunghezza_c Lunghezza_c 2.17
Sesso Sesso 1.05
Fumatrici:Lunghezza_c Fumatrici:Lunghezza_c 1.05

Final model estimates interpretation

The model coefficients displayed are all statistically significant and show that:

  • Gestational age is associated with an average gain of 37 g for each additional week, measured relative to the sample mean, with a positive quadratic component indicating a slightly nonlinear growth trend for higher gestational ages.
  • Head circumference is associated with an average gain of 10.6 g per additional millimeter.
  • Length is associated with an average gain of 10.2 g per additional millimeter, measured relative to the sample mean.
  • The number of previous pregnancies is associated with an average gain of 13 g per single pregnancy.
  • Female sex is associated with a mean birth weight that is 75 g lower than males.
  • Maternal smoking is associated with a lower predicted birth weight, although the main effect is not statistically significant.
  • The interaction between smoking and length indicates a stronger positive association between length and weight in infants of mothers who smoke.

Best model evaluation

#Calculation of R^2 and RMSE of the best model
R_squared <- summary(mod5)$r.squared
R_squared <- round(R_squared, 2)

#Calculation of RMSE of the best model
predicted <- predict(mod5)
actual <- data$Peso
RMSE <- sqrt(mean((actual - predicted)^2))
RMSE <- round(RMSE, 2)

#Table of R^2 and RMSE of the best model
df_res <- data.frame(R_squared, RMSE)
knitr::kable(df_res, caption = "Table of R^2 and RMSE of the best model")
Table of R^2 and RMSE of the best model
R_squared RMSE
0.73 273.75

The model explains approximately 73% of the variability in birth weight and has a root mean square prediction error of about 274 grams, which is relatively small compared to the sample mean weight of approximately 3300 grams.

#Function for test assumption for model residuals
mod_diag <- function(mod) {
  
  bp_test <- lmtest::bptest(mod)
  dw_test <- lmtest::dwtest(mod)
  shapiro_test <- shapiro.test(residuals(mod))
  
  #Diagnostic test table
  diag_tab <- data.frame(
    Test = c("Breusch-Pagan (Heteroscedasticity)",
             "Durbin-Watson (Autocorrelation)",
             "Shapiro-Wilk (Normality of Residuals)"),
    Statistic = c(round(bp_test$statistic, 2),
                  round(dw_test$statistic, 2),
                  round(shapiro_test$statistic, 2)),
    p_value = c(format.pval(bp_test$p.value, digits = 3, eps = 1e-3),
                format.pval(dw_test$p.value, digits = 3, eps = 1e-3),
                format.pval(shapiro_test$p.value, digits = 3, eps = 1e-3)),
    stringsAsFactors = FALSE
  )
  
  kable_diag <- kable(diag_tab, caption = "Diagnostic Tests Summary")%>%
          kableExtra::kable_styling(full_width = FALSE, position = "left")
  
  return(kable_diag)
  
}

#Print of diagnosis results
mod_diag(mod5)
Diagnostic Tests Summary
Test Statistic p_value
BP Breusch-Pagan (Heteroscedasticity) 103.55 <0.001
DW Durbin-Watson (Autocorrelation) 1.95 0.123
W Shapiro-Wilk (Normality of Residuals) 0.97 <0.001

Diagnostic tests reveal significant heteroskedasticity, no residual autocorrelation, and residual deviation from normality, as determined by the Breusch-Pagan, Durbin-Watson, and Shapiro-Wilk tests, respectively. However, the large sample size likely reduces the impact of non-normality on inference.

#Plot of residuals vs fitted
df_resid <- data.frame(
  Fitted = fitted(mod5),
  Residuals = resid(mod5),
  StdResid = rstandard(mod5),
  Index = 1:length(resid(mod5))
)

#Residuals vs fitted
ggplot(df_resid, aes(x = Fitted, y = Residuals)) +
  geom_point(alpha=0.5) +
  geom_hline(yintercept = 0, linetype="dashed") +
  geom_smooth(method = "loess", se=FALSE, color="red") +
  labs(title = "Residuals vs predicted values", x = "Predicted values", y = "Residuals") +
  theme_classic()

The plot of studentized residuals versus predicted values shows increasing prediction variability with higher expected weights, confirming heteroscedasticity.

#Q-Q plot
ggplot(df_resid, aes(sample = StdResid)) +
  stat_qq(color = "blue") + stat_qq_line(color = "red") +
  labs(title = "Q-Q plot of model student residuals",
          x = "Theoretical quantiles",
          y = "Sample quantiles") +
  theme_classic()

The Q-Q plot shows normal behavior for most values (especially the central ones) and highlights deviations from normality mainly in the tails, particularly for the higher weight values, supporting the observed heteroscedasticity.

Influential points analysis

To investigate residual heteroscedasticity, outliers, high leverage, and influential points are visualized in the plot showing the relationship between studentized residuals and predicted values of the target variable.

influential_fun <- function(df_resid, mod) {
  
  n <- length(df_resid$Residuals)
  p <- length(coefficients(mod))
  
  data_diag <- data.frame(
    Index = 1:n,
    Leverage = hatvalues(mod),
    RStudent = df_resid$StdResid,
    CooksDistance = cooks.distance(mod)
  )
  
  #Cutoff for influential points
  leverage_cutoff <- 2 * p / n
  rstudent_cutoff <- 2
  cook_cutoff <- 4 / (n - p)
  
  #Flag influential points
  data_diag <- data_diag %>%
    mutate(
      HighLeverage = Leverage > leverage_cutoff,
      Outlier = abs(RStudent) > rstudent_cutoff,
      Influential = CooksDistance > cook_cutoff
    )
  
  #Influential points table
  influential_points <- data_diag %>%
    filter(HighLeverage | Outlier | Influential) %>%
    arrange(desc(CooksDistance)) %>%
    dplyr::select(Index, Leverage, RStudent, CooksDistance, HighLeverage, Outlier, Influential)
  
  #Return influential points
  return(influential_points)
}

#Influential points
infl_return <- influential_fun(df_resid, mod5)
HighLeverage <- infl_return$HighLeverage
Outlier <- infl_return$Outlier
Influetial <- infl_return$Influential

data_with_flags <- data %>%
  mutate(Index = row_number()) %>%
  left_join(infl_return, by = "Index")

df_resid_full <- data.frame(
  Index = 1:nrow(data_with_flags),
  Fitted = fitted(mod5),
  StdResid = rstandard(mod5)
) %>%
  left_join(infl_return, by = "Index")

#Function for influential points
plot_single_type <- function(data, flag_var, color, shape = 16, title = "") {
  ggplot(data, aes_string(x = "Fitted", y = "StdResid")) +
    geom_point(alpha = 0.3, color = "grey") +
    geom_point(data = filter(data, !!rlang::sym(flag_var)),
               aes(x = Fitted, y = StdResid),
               color = color, shape = shape, size = 3, alpha = 0.8) +
    geom_smooth(method = "loess", se = FALSE) +
    theme_classic() +
    labs(title = title, x = "Predicted values", y = "Std Residuals")
}

p1 <- plot_single_type(df_resid_full, "HighLeverage", "red", 16, "High Leverage Points")
p2 <- plot_single_type(df_resid_full, "Outlier", "blue", 1, "Outlier Points")
p3 <- plot_single_type(df_resid_full, "Influential", "black", 4, "Influential Points")

#Plot of influential
combined_plot <- p1 | p2 | p3  

print(combined_plot)

Influence points and outliers are mainly concentrated at higher predicted values and show biologically plausible values. This pattern appears to reflect the natural variability in infants with larger body sizes. Therefore, modeling approaches that account for non-constant variance, such as generalized linear models (GLMs) with appropriate link functions, such as the logarithmic link, could be considered.

Model predictions

This section tests the model output by simulating the prediction result for selected values of the regressors.

#Prediction with some example values
predict_weight <- function(N_gravidanze, Fumatrici, Gestazione, Lunghezza, Cranio, Sesso, model, mean_gest, mean_lung) {
  
  #Control of factorial levels
  if(!Fumatrici %in% levels(model$model$Fumatrici)) {
    stop("Invalid value for Fumatrici")
  }
  if(!Sesso %in% unique(model$model$Sesso)) {
    stop("Invalid value for Sesso")
  }
  
  #Centering of variable
  Gestazione_c <- as.double(Gestazione - mean_gest)
  Gestazione_c2 <- as.double(Gestazione_c^2)
  Lunghezza_c <- as.double(Lunghezza - mean_lung)
  
  #Creation of matrix 1x1 as the same class of variables in model
  Gestazione_c <- matrix(Gestazione_c, nrow = 1, ncol = 1)
  Gestazione_c2 <- matrix(Gestazione_c2, nrow = 1, ncol = 1)
  Lunghezza_c <- matrix(Lunghezza_c, nrow = 1, ncol = 1)
  
  #New data
  newdata <- data.frame(
    N.gravidanze = as.numeric(N_gravidanze),
    Fumatrici = factor(Fumatrici, levels = levels(model$model$Fumatrici)),
    Cranio = as.numeric(Cranio),
    Sesso = factor(Sesso, levels = levels(model$model$Sesso))
  )
  
  #Add matrix for variables Gestazione_c, Gestazione_c2 and Lunghezza_c
  newdata$Gestazione_c <- I(Gestazione_c)
  newdata$Gestazione_c2 <- I(Gestazione_c2)
  newdata$Lunghezza_c <- I(Lunghezza_c)
  
  #Logaritmic prediction
  w_pred <- predict(model, newdata = newdata)
  
  #Return of transformed prediction
  weight_pred <- w_pred
  
  return(weight_pred)
}

#Mean calculation out of the function for centering variables
mean_gest <- mean(data$Gestazione, na.rm = TRUE)
mean_lung <- mean(data$Lunghezza, na.rm = TRUE)

predicted_weight <- predict_weight(
  N_gravidanze = 3,
  Fumatrici = "NonFumatrici",
  Gestazione = 39,
  Lunghezza = 490, 
  Cranio = 337,
  Sesso = "F",
  model = mod5,
  mean_gest = mean_gest,
  mean_lung = mean_lung
)

In this case, this is the result of prediction. Predicted birth weight: 3190 grams

Model results

To convey the most relevant model findings, the graphical representations focus on the clinically most influential variables for predicting birth weight and on the statistically significant interaction terms retained in the final model.

#Function to plot relationship with regressors and target variable
create_plots <- function(predict_func, model, mean_gest, mean_lung, fixed_vals_list) {
  
  #Create levels for categorical variables
  fumatrici_levels <- levels(model$model$Fumatrici)
  sesso_levels <- levels(model$model$Sesso)
  
  #Plot 1: Peso vs Gestazione and Fumatrici
  gest_seq <- seq(25, 45, 0.5)
  
  df1 <- expand.grid(Gestazione = gest_seq, Fumatrici = fumatrici_levels)
  df1$Predicted <- mapply(function(g,f) {
    predict_func(N_gravidanze = fixed_vals_list$p1$N_gravidanze,
                 Fumatrici = f,
                 Gestazione = g,
                 Lunghezza = fixed_vals_list$p1$Lunghezza,
                 Cranio = fixed_vals_list$p1$Cranio,
                 Sesso = fixed_vals_list$p1$Sesso,
                 model = model,
                 mean_gest = mean_gest,
                 mean_lung = mean_lung)
  }, df1$Gestazione, df1$Fumatrici)
  
  p1 <- ggplot(df1, aes(x=Gestazione, y=Predicted, color=Fumatrici)) +
    geom_point() + 
    geom_smooth(method = "loess")+
    labs(title="Pred. Weight vs Gestazione and Fumatrici") +
    theme_classic()
  
  #Plot 2: Peso vs Lunghezza and Fumatrici
  lung_seq <- seq(300, 560, 0.5)
  df2 <- expand.grid(Lunghezza = lung_seq, Fumatrici = fumatrici_levels)
  df2$Predicted <- mapply(function(l,f) {
    pred <- predict_func(N_gravidanze = fixed_vals_list$p2$N_gravidanze,
                 Fumatrici = f,
                 Gestazione = fixed_vals_list$p2$Gestazione,
                 Lunghezza = l,
                 Cranio = fixed_vals_list$p2$Cranio,
                 Sesso = fixed_vals_list$p2$Sesso,
                 model = model,
                 mean_gest = mean_gest,
                 mean_lung = mean_lung)
  }, df2$Lunghezza, df2$Fumatrici)
  
  p2 <- ggplot(df2, aes(x=Lunghezza, y=Predicted, color=Fumatrici)) +
    geom_point() + 
    geom_smooth(method = "loess")+
    labs(title="Pred.Weight vs Lunghezza and Fumatrici") +
    theme_classic()
  
  return(list(p1, p2))
}

#Creation of fixed vals list
fixed_vals_list <- list(
  p1 = list(N_gravidanze=3, Lunghezza=490, Cranio=337, Sesso="F"),
  p2 = list(N_gravidanze=3, Gestazione=39, Cranio=337, Sesso="F")
)

#Execution and visualization
plots <- create_plots(predict_weight, mod5, mean_gest, mean_lung, fixed_vals_list)

#Creation of wrap plots
wrap_plots(plots, ncol=2, nrow=1)

A nonlinear relationship between expected weight and gestational age is confirmed, with greater gain in the final weeks and lower weight for newborns of mothers who smoke. Weight and length are positively correlated, with differences observed between smoking groups.

These findings support the biological importance of gestation and anthropometric measurements and the impact of smoking on neonatal growth.

Conclusion

The final model provides a reliable estimate of neonatal birth weight and highlights gestational age, maternal smoking, sex, and anthropometric measurements as determining factors. Predictive performance is generally satisfactory. Greater uncertainty is observed at extreme weight values, suggesting that model estimates should complement rather than replace clinical judgment.

Operationally, the model can support the identification of high-risk cases, such as preterm births, and resource planning to manage such cases. Future developments could include the integration of additional fetal growth data and additional maternal variables to improve personalization and predictive robustness.