Project introduction

The project aims to develop a statistical model to accurately predict newborn birth weight using clinical data from three hospitals. This initiative seeks to improve the management of high-risk pregnancies, optimize hospital resources, and promote better neonatal outcomes, within a context of growing attention to the prevention of neonatal complications. Predicting neonatal birth weight is essential for optimal care planning and reducing risks associated with preterm births or low birth weight infants.

The main benefits of the project are:

#Load dataset
data <- read.csv("C:/Users/matmi/OneDrive/Desktop/ProgettiR/neonati.csv")

#First 10 rows of dataset
kable(head(data, 10), caption = "First 10 rows of dataset")
First 10 rows of dataset
Anni.madre N.gravidanze Fumatrici Gestazione Peso Lunghezza Cranio Tipo.parto Ospedale Sesso
26 0 0 42 3380 490 325 Nat osp3 M
21 2 0 39 3150 490 345 Nat osp1 F
34 3 0 38 3640 500 375 Nat osp2 M
28 1 0 41 3690 515 365 Nat osp2 M
20 0 0 38 3700 480 335 Nat osp3 F
32 0 0 40 3200 495 340 Nat osp2 F
26 1 0 39 3100 480 345 Nat osp3 F
25 0 0 40 3580 510 349 Nat osp1 M
22 1 0 40 3670 500 335 Ces osp2 F
23 0 0 41 3700 510 362 Ces osp2 F
#Summary of dataset
kable(summary(data), caption = "Statistical index summary of dataset")
Statistical index summary of dataset
Anni.madre N.gravidanze Fumatrici Gestazione Peso Lunghezza Cranio Tipo.parto Ospedale Sesso
Min. : 0.00 Min. : 0.0000 Min. :0.0000 Min. :25.00 Min. : 830 Min. :310.0 Min. :235 Length:2500 Length:2500 Length:2500
1st Qu.:25.00 1st Qu.: 0.0000 1st Qu.:0.0000 1st Qu.:38.00 1st Qu.:2990 1st Qu.:480.0 1st Qu.:330 Class :character Class :character Class :character
Median :28.00 Median : 1.0000 Median :0.0000 Median :39.00 Median :3300 Median :500.0 Median :340 Mode :character Mode :character Mode :character
Mean :28.16 Mean : 0.9812 Mean :0.0416 Mean :38.98 Mean :3284 Mean :494.7 Mean :340 NA NA NA
3rd Qu.:32.00 3rd Qu.: 1.0000 3rd Qu.:0.0000 3rd Qu.:40.00 3rd Qu.:3620 3rd Qu.:510.0 3rd Qu.:350 NA NA NA
Max. :46.00 Max. :12.0000 Max. :1.0000 Max. :43.00 Max. :4930 Max. :565.0 Max. :390 NA NA NA
attach(data)

To build the predictive model, data were collected on 2,500 newborns from three hospitals. The collected variables include:

The dataset shows the following key characteristics:

Descriptive analysis of dataset

The dataset analysis is structured into the following main phases:

Statistical index calculation

#Function for statistical index calculation of each variable
calc_stats <- function(x) {
  c(
    mean = mean(x, na.rm = TRUE),
    median = median(x, na.rm = TRUE),
    sd = sd(x, na.rm = TRUE),
    skewness = moments::skewness(x, na.rm = TRUE),
    kurtosis = moments::kurtosis(x, na.rm = TRUE)
  )
}

#Matrix of statistical index for each numeric variable of dataset
numeric_vars <- sapply(data, is.numeric)
numeric_vars <- numeric_vars & sapply(data, function(x) length(unique(na.omit(x))) > 2)
stat_summary <- sapply(data[, numeric_vars], calc_stats)
kable(stat_summary, caption = "Matrix of statistical index for each numeric variable of dataset")
Matrix of statistical index for each numeric variable of dataset
Anni.madre N.gravidanze Gestazione Peso Lunghezza Cranio
mean 28.1640000 0.981200 38.980400 3284.0808000 494.692000 340.0292000
median 28.0000000 1.000000 39.000000 3300.0000000 500.000000 340.0000000
sd 5.2735783 1.280587 1.868639 525.0387442 26.318644 16.4253299
skewness 0.0428115 2.514254 -2.065313 -0.6470308 -1.514699 -0.7850527
kurtosis 3.3804165 13.989406 11.258150 5.0315318 9.487174 5.9462063

The calculated statistics show the following situation for each variable:

  • Mother’s age: Distribution close to normal, with a slight heavy tail; it can be initially considered normal.
  • Number of pregnancies: Non-normal distribution, strongly skewed with a high peak; far from normal.
  • Gestation duration: Non-normal distribution, left-skewed with heavy tails; cannot be considered normal.
  • Birth weight: Approximately normal distribution but with negative skewness and heavy tails; moderately far from normal.
  • Length: Non-normal distribution, skewed towards lower values with heavy tails; far from normal.
  • Head circumference: Slightly skewed distribution with moderate tails; can be considered approximately normal with caution.

Shapiro-Wilk test for numerical variables

This phase aims to assess whether the distributions of the numerical variables in the dataset approximate a normal distribution, a fundamental prerequisite for the use of many parametric statistical techniques, such as the t-test or linear regression. To this end, statistical tests (Shapiro-Wilk) and visual methods, such as density plots compared to the theoretical normal distribution and Q-Q plots (quantile-quantile), were employed. These allow identification of any asymmetries, heavy tails, or deviations from normality.

The results of this analysis will guide the selection of the most appropriate statistical tools and models in the subsequent phases of the project.

#Function for Shapiro-Wilk test
shapiro_summary <- function(x, alpha = 0.05) {
  x <- na.omit(x)
 
  if (length(x) < 3) {
    return(NA)
  }
  if (length(x) > 5000) {
    x <- sample(x, 5000)
  }
  test <- shapiro.test(x)
  p_value <- test$p.value
  result <- ifelse(p_value > alpha, "Fail to reject H0 (normality)", "Reject H0 (not normal)")
  
  return(c(p_value = p_value, conclusion = result))
}

shapiro_results <- lapply(data[, numeric_vars], shapiro_summary)

shapiro_df <- do.call(rbind, lapply(shapiro_results, function(x) {
  data.frame(p_value = x["p_value"], conclusion = x["conclusion"])
}))

rownames(shapiro_df) <- names(shapiro_results)

kable(shapiro_df, caption = "Results of Shapiro-Wilk test on each variable of dataset", row.names = TRUE)
Results of Shapiro-Wilk test on each variable of dataset
p_value conclusion
Anni.madre 1.63894746545117e-09 Reject H0 (not normal)
N.gravidanze 8.89524555042765e-54 Reject H0 (not normal)
Gestazione 2.6437822902771e-45 Reject H0 (not normal)
Peso 3.2344586185273e-22 Reject H0 (not normal)
Lunghezza 3.18605170830064e-36 Reject H0 (not normal)
Cranio 1.26291692019115e-24 Reject H0 (not normal)

The Shapiro-Wilk test indicates that no variable is perfectly normal, but in large samples even small deviations lead to rejecting the normality hypothesis. However, in large samples, parametric tests are generally robust to moderate departures from normality, especially when supported by graphical analyses (to be conducted in the following phases) and by considering the practical relevance of the deviations observed. This strategy allows making the most of the effectiveness and sensitivity of parametric methods to draw solid conclusions, while remaining attentive to potential limitations caused by more pronounced deviations.

Numerical variable distribution visual analysis

#Function for density plot
plot_density <- function(x, main_title = deparse(substitute(x))) {
  x <- na.omit(x)
  
  m <- mean(x)
  s <- sd(x)
  
  df <- data.frame(value = x)
  
  x_seq <- seq(min(x), max(x), length.out = 1000)
  norm_density <- dnorm(x_seq, mean = m, sd = s)
  
  norm_df <- data.frame(x = x_seq, y = norm_density)
  
  ggplot(df, aes(x = value)) +
    geom_density(color = "black", fill = "gray80") +
    geom_line(data = norm_df, aes(x = x, y = y), color = "red", size = 1) +
    labs(title = paste("Density plot of", main_title),
         x = "Value",
         y = "Density") +
    theme_classic() +
    theme(plot.title = element_text(hjust = 0.5))
}

#Variables for plot
vars_to_plot <- data[, numeric_vars]

plots_list <- lapply(names(vars_to_plot), function(var_name) {
  plot_density(vars_to_plot[[var_name]], main_title = var_name)
})

#Plot of each variable distribution vs normal distribution with same mean and standard deviation
combined_plot <- wrap_plots(plots_list, ncol = 2, nrow = 3)
print(combined_plot)

The visual analysis of the density plots confirms the main characteristics highlighted by the statistical indicators:

  • Maternal age and birth weight show approximately symmetric and near-normal distributions.
  • Number of pregnancies has a highly skewed distribution toward low values.
  • Gestation duration presents a slight skewness toward lower weeks, consistent with the presence of prematurity.
  • Length and head circumference are largely symmetric, with only a slight skewness in length.

Numerical variable Q-Q plot analysis

#Function for Q-Q plot
qq_plot <- function(x, main_title = deparse(substitute(x))) {
  x <- na.omit(x)
  
  df <- data.frame(value = x)
  
  ggplot(df, aes(sample = value)) +
    stat_qq(color = "blue")+
    stat_qq_line(color = "red")+
    labs(title = paste("Q-Q plot of", main_title),
          x = "Theoretical quantiles",
          y = "Sample quantiles") +
    theme_classic()+
    theme(plot.title = element_text(hjust = 0.5))
}

qq_list <- lapply(names(vars_to_plot), function(var_name) {
  qq_plot(vars_to_plot[[var_name]], main_title = var_name)
})

combined_plot <- wrap_plots(qq_list, ncol = 2, nrow = 3)
print(combined_plot)

The results of the Q-Q plots, which display the quantile-quantile distribution of the variables compared to theoretical normality, confirm the previous observations regarding the closeness of some variables to normality and the more marked deviations of others, especially in the tails and asymmetries.

Numerical variable outlier analysis

In this phase, an outlier analysis was performed using the interquartile range (IQR) rule to identify anomalous values that could affect the quality of the analysis. Clinically relevant data, such as cases of prematurity or multiple pregnancies, were retained, while only values deemed implausible were removed to ensure the reliability and representativeness of the sample.

#Function for boxplot
box_plot <- function(x, main_title = deparse(substitute(x))) {
  x <- na.omit(x)
  
  df <- data.frame(value = x)
  
  ggplot(df, aes(x = "", y = value)) +
    geom_boxplot(fill = "lightblue", color = "darkblue") +
    labs(title = paste("Boxplot of", main_title),
         y = main_title,
         x = "") +
    theme_classic() +
    theme(axis.text.x = element_blank(),
          axis.ticks.x = element_blank(),
          plot.title = element_text(hjust = 0.5))
}

box_list <- lapply(names(vars_to_plot), function(var_name) {
  box_plot(vars_to_plot[[var_name]], main_title = var_name)
})

combined_plot <- wrap_plots(box_list, ncol = 3, nrow = 2)
print(combined_plot)

#Function to calculate outliers
calc_outliers <- function(x) {
  Q1 <- quantile(x, 0.25, na.rm = TRUE)
  Q3 <- quantile(x, 0.75, na.rm = TRUE)
  IQR <- Q3 - Q1
  lower_bound <- Q1 - 1.5 * IQR
  upper_bound <- Q3 + 1.5 * IQR
  which_outliers <- which(x < lower_bound | x > upper_bound)
  
  if(length(which_outliers) == 0) {
    return(NULL)
  }
  
  n_outliers <- length(which_outliers)  
  total <- sum(!is.na(x))
  percent_outliers <- 100 * n_outliers / total 
  
  conclusion <- ifelse(percent_outliers > 5, "High number of outliers", "Acceptable number of outliers")
  
  return(list(n_outliers = n_outliers, percent_outliers = percent_outliers, conclusion = conclusion))
}

outliers_list <- lapply(data[, numeric_vars], calc_outliers)
outliers_df <- bind_rows(lapply(outliers_list, as.data.frame), .id = "Variable")

kable(outliers_df, caption = "Table of number, percentage and conclusion for outliers of each numerical variable")
Table of number, percentage and conclusion for outliers of each numerical variable
Variable n_outliers percent_outliers conclusion
Anni.madre 13 0.52 Acceptable number of outliers
N.gravidanze 246 9.84 High number of outliers
Gestazione 67 2.68 Acceptable number of outliers
Peso 69 2.76 Acceptable number of outliers
Lunghezza 59 2.36 Acceptable number of outliers
Cranio 48 1.92 Acceptable number of outliers

The outlier analysis shows that the number of pregnancies has a significant proportion of extreme values (9.84%) that could influence the model, while mother’s age has few outliers likely due to errors that need correction. The other variables exhibit few anomalies, mainly attributable to clinically valid data (e.g., prematurity), which should be retained. Overall, the data are of good quality and suitable for modeling, with particular attention to managing outliers in “Number of pregnancies” and “Mother’s age.”

#Function to find and remove error in variable Anni.Madre
remove_rows_condition <- function(data, var, condition_func) {
  rows_to_remove <- which(condition_func(data[[var]]))
  data_filtered <- data[-rows_to_remove, ]
  return(data_filtered)
}

data <- remove_rows_condition(data, "Anni.madre", function(x) x < 10)

#Summary of dataset after removing error in variable Anni.Madre
kable(summary(data), caption = "Summary of dataset after removing error in variable Anni.Madre")
Summary of dataset after removing error in variable Anni.Madre
Anni.madre N.gravidanze Fumatrici Gestazione Peso Lunghezza Cranio Tipo.parto Ospedale Sesso
Min. :13.00 Min. : 0.0000 Min. :0.00000 Min. :25.00 Min. : 830 Min. :310.0 Min. :235 Length:2498 Length:2498 Length:2498
1st Qu.:25.00 1st Qu.: 0.0000 1st Qu.:0.00000 1st Qu.:38.00 1st Qu.:2990 1st Qu.:480.0 1st Qu.:330 Class :character Class :character Class :character
Median :28.00 Median : 1.0000 Median :0.00000 Median :39.00 Median :3300 Median :500.0 Median :340 Mode :character Mode :character Mode :character
Mean :28.19 Mean : 0.9816 Mean :0.04163 Mean :38.98 Mean :3284 Mean :494.7 Mean :340 NA NA NA
3rd Qu.:32.00 3rd Qu.: 1.0000 3rd Qu.:0.00000 3rd Qu.:40.00 3rd Qu.:3620 3rd Qu.:510.0 3rd Qu.:350 NA NA NA
Max. :46.00 Max. :12.0000 Max. :1.00000 Max. :43.00 Max. :4930 Max. :565.0 Max. :390 NA NA NA

An exploratory analysis was conducted by identifying and removing evident errors, such as maternal ages below biologically plausible values, eliminating only a few observations without compromising the overall quality of the dataset. Outliers related to clinical variables such as weight, length, head circumference, and gestation duration were deemed plausible and therefore retained, as were the rare high values in the “N.gravidanze” variable. However, considering the potential influence of these extreme values on the model, it might be useful to further investigate the analysis by applying robust statistical methods or variable transformations in the future, while carefully assessing whether these approaches lead to meaningful improvements without overly complicating the model’s interpretability and parsimony.

Multivariate analysis of dataset and statistical tests

Below, a multivariate analysis of the dataset variables is conducted. This analysis aims to examine the distributions of numerical variables in relation to the corresponding levels of categorical variables, in order to identify differences between groups, through:

  • Calculation of the means of each numerical variable for all groups, along with their differences.
  • Statistical tests and boxplot visualizations to determine, for each numerical variable, whether the distributions differ significantly among groups.
  • Focus on specific considerations, such as: assessing if there is a difference in the number of cesarean deliveries among hospitals, verifying whether the mean birth weight and length significantly differ from those of the general population, and checking if anthropometric measurements differ significantly between sexes.

Mean calculation beetwen groups and relative differences

#Function to calculate mean aggregated by other variables
calc_aggregate <- function(data, num_var, cat_var) {
  agg <- data %>%
    group_by(across(all_of(cat_var))) %>%
    summarise(
      mean = mean(.data[[num_var]], na.rm = TRUE),
      .groups = "drop"
    )%>%
    pivot_longer(cols = mean, names_to = "Stat", values_to = "Value") %>%
    pivot_wider(names_from = !!sym(cat_var), values_from = Value) %>%
    mutate(Variable = num_var, .before = 1)
  
  #Extract only mean
  means_row <- agg %>% dplyr::select(-Variable, -Stat)
  
  group_names <- colnames(means_row)
  means_values <- as.numeric(means_row[1, ])
  
  #Function to calculate mean differences between pairs of values for each categorical variable
  comb_pairs <- combn(seq_along(means_values), 2, simplify = FALSE)
  
  diff_cols <- map(comb_pairs, function(idx) {
      i <- idx[1]
      j <- idx[2]
      diff_val <- means_values[j] - means_values[i]
      col_name <- paste0("diff_", group_names[j], "_minus_", group_names[i])
      
      diff_col <- rep(NA_real_, nrow(agg))
      media_row_index <- which(agg$Stat == "mean")
      diff_col[media_row_index] <- diff_val
    
      list(name = col_name, values = diff_col)
  })
  
  for (c in diff_cols) {
    agg[[c$name]] <- c$values
  }
  
  return(agg)
}

#Definition of categorical variables
categorical_vars <- c("Fumatrici", "Tipo.parto", "Ospedale", "Sesso")
numerical_vars <- names(data)[numeric_vars]

results_list <- list()

#For statement to save all dataset (aggregated index by each categorical variable)
for (cat_var in categorical_vars) {
  agg_tables <- lapply(numerical_vars, function(num_var) {
    df_agg <- calc_aggregate(data, num_var, cat_var)
  })
  results_list[[cat_var]] <- bind_rows(agg_tables)
}

#Print of each calculated table
kable(results_list[["Fumatrici"]], caption = "Aggregated stats by Fumatrici")
Aggregated stats by Fumatrici
Variable Stat 0 1 diff_1_minus_0
Anni.madre mean 28.1804511 28.317308 0.1368566
N.gravidanze mean 0.9690894 1.269231 0.3001414
Gestazione mean 38.9670008 39.269231 0.3022299
Peso mean 3286.2623225 3236.346154 -49.9161686
Lunghezza mean 494.8099415 492.067308 -2.7426338
Cranio mean 340.0588972 339.346154 -0.7127434
kable(results_list[["Tipo.parto"]], caption = "Aggregated stats by Tipo.parto")
Aggregated stats by Tipo.parto
Variable Stat Ces Nat diff_Nat_minus_Ces
Anni.madre mean 28.201923 28.1796610 -0.0222621
N.gravidanze mean 1.023352 0.9644068 -0.0589449
Gestazione mean 39.023352 38.9615819 -0.0617697
Peso mean 3282.046703 3285.0632768 3.0165735
Lunghezza mean 496.365385 494.0090395 -2.3563451
Cranio mean 340.012363 340.0361582 0.0237956
kable(results_list[["Ospedale"]], caption = "Aggregated stats by Ospedale")
Aggregated stats by Ospedale
Variable Stat osp1 osp2 osp3 diff_osp2_minus_osp1 diff_osp3_minus_osp1 diff_osp3_minus_osp2
Anni.madre mean 28.045343 28.1037736 28.407674 0.0584304 0.3623307 0.3039003
N.gravidanze mean 0.995098 0.9245283 1.026379 -0.0705697 0.0312809 0.1018506
Gestazione mean 38.944853 38.9693396 39.023981 0.0244867 0.0791279 0.0546412
Peso mean 3270.265931 3270.5070755 3311.708633 0.2411441 41.4427017 41.2015576
Lunghezza mean 494.120098 495.3396226 494.604316 1.2195246 0.4842185 -0.7353061
Cranio mean 339.927696 339.8856132 340.274580 -0.0420829 0.3468843 0.3889671
kable(results_list[["Sesso"]], caption = "Aggregated stats by Sesso")
Aggregated stats by Sesso
Variable Stat F M diff_M_minus_F
Anni.madre mean 28.1370518 28.235720 0.0986682
N.gravidanze mean 0.9498008 1.013677 0.0638758
Gestazione mean 38.7314741 39.230089 0.4986144
Peso mean 3161.0613546 3408.495575 247.4342206
Lunghezza mean 489.7641434 499.674980 9.9108365
Cranio mean 337.6231076 342.458568 4.8354604

Statistical tests to investigate differences beetwen groups (focus on Peso and Lunghezza beetwen Sesso)

#Function to explore differences beetwen groups for each numerical variable
data$Fumatrici <- factor(data$Fumatrici, levels = c(0, 1), labels = c("NonFumatrici", "Fumatrici"))
categorical_for_test <- c("Fumatrici", "Tipo.parto", "Ospedale", "Sesso")

test_group_diff <- function(data, num_var, cat_var) {
  x <- data[[num_var]]
  g <- data[[cat_var]]
  
  df <- data.frame(x = x, g = g) %>% filter(!is.na(x), !is.na(g))
  
  n_groups <- length(unique(df$g))
  
  #Normality test for each group
  normality <- by(df$x, df$g, function(v) shapiro.test(v)$p.value)
  normal <- all(unlist(normality) > 0.05)
  
  #Omogeneity variance test (Levene)
  levene_p <- car::leveneTest(x ~ g, data = df)$"Pr(>F)"[1]
  variance_homogeneous <- (levene_p > 0.05)
  
  if (n_groups == 2) {
    if (normal && variance_homogeneous) {
      test <- t.test(x ~ g, data = df, var.equal = TRUE)
      test_name <- "t-test (equal variance)"
      nparam_test_name <- NA_character_
      nparam_pvalue <- NA_real_
    } else {
      test <- t.test(x ~ g, data = df, var.equal = FALSE)
      test_name <- "Welch t-test (different variance)"
      nparam_test_name <- "Wilcoxon rank-sum test"
      wtest <- wilcox.test(x ~ g, data = df)
      nparam_pvalue <- wtest$p.value
    }
  } else if (n_groups > 2) {
    if (normal && variance_homogeneous) {
      test <- aov(x ~ g, data = df)
      test_name <- "ANOVA (equal variance)"
      nparam_test_name <- NA_character_
      nparam_pvalue <- NA_real_
    } else {
      test <- oneway.test(x ~ g, data = df, var.equal = FALSE)
      test_name <- "Welch ANOVA (different variance)"
      nparam_test_name <- "Kruskal-Wallis test"
      kwtest <- kruskal.test(x ~ g, data = df)
      nparam_pvalue <- kwtest$p.value
    }
  } else {
    return(NA)
  }
  
  if (test_name == "ANOVA (equal variance)") {
    param_pvalue <- summary(test)[[1]][["Pr(>F)"]][1]
  } else {
    param_pvalue <- test$p.value
  }
  
  return(list(
    test = test_name,
    nparam_test = nparam_test_name,
    param_pvalue = param_pvalue,
    nparam_pvalue = nparam_pvalue
  ))
}

#Return of test results
results_tests <- expand.grid(num_var = numerical_vars, cat_var = categorical_vars, stringsAsFactors = FALSE)

results_tests <- results_tests %>%
  rowwise() %>%
  mutate(test_result = list(test_group_diff(data, num_var, cat_var))) %>%
  mutate(
  test = test_result$test,
  param_pvalue = test_result$param_pvalue,
  param_significance = case_when(
    param_pvalue < 0.001 ~ "***",
    param_pvalue < 0.01  ~ "**",
    param_pvalue < 0.05  ~ "*",
    TRUE                 ~ ""
  ),
  nparam_test = test_result$nparam_test,
  nparam_pvalue = test_result$nparam_pvalue,
  nparam_significance = case_when(
    nparam_pvalue < 0.001 ~ "***",
    nparam_pvalue < 0.01  ~ "**",
    nparam_pvalue < 0.05  ~ "*",
    TRUE                  ~ ""
    ),
  p_value_concordance = ifelse((param_pvalue < 0.05 & nparam_pvalue < 0.05) |
         (param_pvalue >= 0.05 & nparam_pvalue >= 0.05), "P-value concordance", "P-value discordance")
  ) %>%
  ungroup() %>%
  dplyr::select(num_var, cat_var, test, param_pvalue, param_significance, nparam_test, nparam_pvalue, nparam_significance, p_value_concordance)

#Print of results
kable(
  results_tests,
  caption = "Statistical test results for differences beetwen groups",
  col.names = c("Numerical variable", "Categorical variable", "Parametric test", "P-value pametric test", "Significance parametric test", "Not parametric test", "P-value not parametric test", "Significance not parametric test", "P-Value concordance")
)
Statistical test results for differences beetwen groups
Numerical variable Categorical variable Parametric test P-value pametric test Significance parametric test Not parametric test P-value not parametric test Significance not parametric test P-Value concordance
Anni.madre Fumatrici Welch t-test (different variance) 0.8016453 Wilcoxon rank-sum test 0.8716883 P-value concordance
N.gravidanze Fumatrici Welch t-test (different variance) 0.0188038 * Wilcoxon rank-sum test 0.0022427 ** P-value concordance
Gestazione Fumatrici Welch t-test (different variance) 0.0388965 * Wilcoxon rank-sum test 0.2098755 P-value discordance
Peso Fumatrici Welch t-test (different variance) 0.3022785 Wilcoxon rank-sum test 0.0592839 P-value concordance
Lunghezza Fumatrici Welch t-test (different variance) 0.2100567 Wilcoxon rank-sum test 0.0373755 * P-value discordance
Cranio Fumatrici Welch t-test (different variance) 0.6149153 Wilcoxon rank-sum test 0.3392392 P-value concordance
Anni.madre Tipo.parto Welch t-test (different variance) 0.9240376 Wilcoxon rank-sum test 0.9481223 P-value concordance
N.gravidanze Tipo.parto Welch t-test (different variance) 0.2881163 Wilcoxon rank-sum test 0.1196188 P-value concordance
Gestazione Tipo.parto Welch t-test (different variance) 0.4343654 Wilcoxon rank-sum test 0.9322109 P-value concordance
Peso Tipo.parto Welch t-test (different variance) 0.8916349 Wilcoxon rank-sum test 0.5243601 P-value concordance
Lunghezza Tipo.parto Welch t-test (different variance) 0.0326031 * Wilcoxon rank-sum test 0.0879934 P-value discordance
Cranio Tipo.parto Welch t-test (different variance) 0.9728416 Wilcoxon rank-sum test 0.6432892 P-value concordance
Anni.madre Ospedale Welch ANOVA (different variance) 0.3091713 Kruskal-Wallis test 0.2952401 P-value concordance
N.gravidanze Ospedale Welch ANOVA (different variance) 0.2472083 Kruskal-Wallis test 0.0661108 P-value concordance
Gestazione Ospedale Welch ANOVA (different variance) 0.6924804 Kruskal-Wallis test 0.2977338 P-value concordance
Peso Ospedale Welch ANOVA (different variance) 0.1749530 Kruskal-Wallis test 0.2164421 P-value concordance
Lunghezza Ospedale Welch ANOVA (different variance) 0.6356001 Kruskal-Wallis test 0.7351307 P-value concordance
Cranio Ospedale Welch ANOVA (different variance) 0.8647781 Kruskal-Wallis test 0.9283076 P-value concordance
Anni.madre Sesso Welch t-test (different variance) 0.6366427 Wilcoxon rank-sum test 0.7286129 P-value concordance
N.gravidanze Sesso Welch t-test (different variance) 0.2129632 Wilcoxon rank-sum test 0.2907222 P-value concordance
Gestazione Sesso Welch t-test (different variance) 0.0000000 *** Wilcoxon rank-sum test 0.0000000 *** P-value concordance
Peso Sesso Welch t-test (different variance) 0.0000000 *** Wilcoxon rank-sum test 0.0000000 *** P-value concordance
Lunghezza Sesso Welch t-test (different variance) 0.0000000 *** Wilcoxon rank-sum test 0.0000000 *** P-value concordance
Cranio Sesso Welch t-test (different variance) 0.0000000 *** Wilcoxon rank-sum test 0.0000000 *** P-value concordance

The objective of the analysis is to identify significant differences between groups in numerical variables by selecting appropriate statistical tests based on the data distributions. Although normality tests indicate deviations, many variables show approximately normal distributions with slight skewness and outliers. For this reason, standard parametric tests are used when normality and homogeneity of variances assumptions are met, supplemented by robust tests (Welch) in cases of heteroscedasticity, and by non-parametric tests (Wilcoxon and Kruskal-Wallis) when parametric assumptions are violated.

The combined application of these methods allows for reliable detection of significant differences between groups, such as the effect of maternal smoking on the number of pregnancies and newborn size, and differences related to sex on gestational duration and anthropometric characteristics.

Specifically, as explicitly requested, it is verified that anthropometric measurements differ significantly between the two sexes. This finding is confirmed by both parametric and non-parametric tests.

These results will be further illustrated in the following boxplots.

#Function for boxplot
box_plot_grouped <- function(data, num_var, group_var, main_title = NULL) {
  df <- data %>%
    dplyr::select(all_of(c(num_var, group_var))) %>%
    filter(!is.na(.data[[num_var]]), !is.na(.data[[group_var]]))
  
  if (is.null(main_title)) {
    main_title <- paste("Boxplot of", num_var, "by", group_var)
  }
  
  ggplot(df, aes_string(x = group_var, y = num_var, fill = group_var)) +
    geom_boxplot() +
    labs(title = main_title, x = group_var, y = num_var) +
    theme_classic() +
    theme(plot.title = element_text(hjust = 0.5))
}

categorical_vars <- c("Fumatrici", "Tipo.parto", "Ospedale", "Sesso")

#Boxplot by group of categorical variables
plots_by_group <- lapply(categorical_vars, function(cat_var) {
  p_list <- lapply(numerical_vars, function(num_var) {
    box_plot_grouped(data, num_var, cat_var)
  })
  wrap_plots(p_list, ncol = 2) 
})

names(plots_by_group) <- categorical_vars

#Plot of each boxplot
print(plots_by_group[["Fumatrici"]])

print(plots_by_group[["Tipo.parto"]])

print(plots_by_group[["Ospedale"]])

print(plots_by_group[["Sesso"]])

The visual analysis through boxplots supports the statistical results, highlighting significant differences in gestational weeks and newborn length between smokers and non-smokers, as well as pronounced differences by newborn sex in gestational age, weight, length, and head circumference. No significant differences emerge for delivery type or hospital.

Statistical test for differences in cesarean deliveries among hospitals

As a further analysis, a chi-squared test of independence is conducted to verify whether the distribution of delivery type (natural or cesarean) significantly differs among the three hospitals in the dataset. This test allows the evaluation of whether an association exists between the healthcare facility and the mode of delivery.

calc_tab_cont <- function(var1, var2) {
  cont_tab <- table(var1, var2)
  return(cont_tab)
}

cont_tab <- calc_tab_cont(data$Tipo.parto, data$Ospedale)

kable(cont_tab, caption = "Contingency table for variables Tipo.parto and Ospedale")
Contingency table for variables Tipo.parto and Ospedale
osp1 osp2 osp3
Ces 242 254 232
Nat 574 594 602
result_t1 <- chisq.test(cont_tab)
test_ces_df <- data.frame(
  Method = result_t1$method,
  Statistic = round(as.numeric(result_t1$statistic), 4),
  P_Value = round(result_t1$p.value, 4)
)

kable(test_ces_df, caption = "Result of chi-squared test on on the contingency table between Tipo.parto and Ospedale")
Result of chi-squared test on on the contingency table between Tipo.parto and Ospedale
Method Statistic P_Value
Pearson’s Chi-squared test 1.083 0.5819

The statistical analysis using the chi-squared test on the contingency table between ‘Tipo.parto’ and ‘Ospedale’ revealed a p-value of 0.578, which is above the conventional significance threshold of 0.05. This indicates that there are no significant differences in the distribution of delivery types (natural or cesarean) among the three hospitals considered. Therefore, in the studied sample, the place of delivery does not appear to influence the frequency of cesarean births.

Statistical test for differences of Peso and Lunghezza from population

Moreover, a statistical verification is conducted to assess the difference between the sample means of newborn weight and length and their respective reference values (3300 grams for weight and 500 mm for length). These reference values are taken from the 2024 Cedap report of the Italian Ministry of Health and are derived from the data collected in the homonymous information flows that each Health Organization is required to produce monthly. To this end, two one-sample tests are employed: the Student’s t-test, based on the assumption of normality, and the Wilcoxon test, which is non-parametric and more robust to deviations from the normality assumption.

#Function to test Peso and Lunghezza then same population variable
compare_to_population <- function(x, mu, alpha = 0.05) {
  x <- na.omit(x)
  
  #T test
  t_res <- t.test(x, mu = mu)
  
  #Wilcoxon test
  wilcox_res <- wilcox.test(x, mu = mu)
  
  #Summary of results
  result <- data.frame(
      parametric_test = "One-sample t-test",
      p_value_ptest = t_res$p.value,
      conclusion_ptest = ifelse(t_res$p.value < alpha,
                          "Reject H0. Target variable is statistically different from population variable",
                          "Fail to Reject H0. Target variable is not statistically different from population variable"),
      nparametric_test = "Wilcoxon signed-rank test",
      p_value_nptest = wilcox_res$p.value,
      conclusion_nptest = ifelse(wilcox_res$p.value < alpha,
                          "Reject H0. Target variable is statistically different from population variable",
                          "Fail to Reject H0. Target variable is not statistically different from population variable"),
      stringsAsFactors = FALSE    
    )
  
    return(result)
 }

#Test for weight and length
weight_ref <- 3300
length_ref <- 500

weight_test <- compare_to_population(data$Peso, mu = weight_ref)
length_test <- compare_to_population(data$Lunghezza, mu = length_ref)

kable(weight_test, caption = "Summary table of results for compare beetwen Peso of sample and weight of population",
      col.names = c("Parametric test", "P-value parametric test", "Conclusion parametric test", "Not parametric test", "P-value not parametric test", "Conclusion not parametric test"))
Summary table of results for compare beetwen Peso of sample and weight of population
Parametric test P-value parametric test Conclusion parametric test Not parametric test P-value not parametric test Conclusion not parametric test
One-sample t-test 0.1324476 Fail to Reject H0. Target variable is not statistically different from population variable Wilcoxon signed-rank test 0.9482417 Fail to Reject H0. Target variable is not statistically different from population variable
kable(length_test, caption = "Summary table of results for compare beetwen Lunghezza of sample and length of population",
      col.names = c("Parametric test", "P-value parametric test", "Conclusion parametric test", "Not parametric test", "P-value not parametric test", "Conclusion not parametric test"))
Summary table of results for compare beetwen Lunghezza of sample and length of population
Parametric test P-value parametric test Conclusion parametric test Not parametric test P-value not parametric test Conclusion not parametric test
One-sample t-test 0 Reject H0. Target variable is statistically different from population variable Wilcoxon signed-rank test 0 Reject H0. Target variable is statistically different from population variable

Both parametric and non-parametric tests confirm that the average newborn weight in the sample does not differ significantly from the reference value of 3300 grams, indicating a good correspondence with the reference population. Conversely, for length, both tests detect a significant difference compared to the 500 mm reference, indicating that the average length in the sample deviates from the expected value. The consistency between the tests is supported by the approximate normality of the distributions, which justifies the use of parametric tests and strengthens the validity of the conclusions through the consistent results from the non-parametric tests.

Correlation analysis

In this phase, correlation analyses are performed among the dataset variables to identify any linear relationships between them. First, the global relationship among the numerical variables in the dataset is analyzed, followed by the relationship between numerical variables within each group defined by the values of the categorical variables.

#Select only numerical variables
num_data <- data[, numeric_vars]

#Correlation matrix for numerical variable of dataset
ggpairs(
  num_data,
  lower = list(continuous = wrap("smooth_loess", alpha = 0.3, size = 0.5)),
  upper = list(continuous = wrap("cor", size = 5)),
  diag = list(continuous = wrap("densityDiag"))
) +
  theme_classic() +
  ggtitle("Correlation matrix between numerical variable of dataset") +
  theme(axis.text.x = element_text(angle = 60, hjust = 1))

#Function to plot correlation matrix among groups
plot_grouped_ggpairs <- function(data, cat_var, numeric_vars) {
  
  groups <- unique(data[[cat_var]])
  
  plots_list <- lapply(groups, function(g) {
    df_subset <- data[data[[cat_var]] == g, ]
    
    p <- ggpairs(
      df_subset,
      columns = which(names(data) %in% numeric_vars),
      lower = list(continuous = wrap("smooth_loess", alpha = 0.3)),
      upper = list(continuous = wrap("cor", size = 5)),
      diag = list(continuous = wrap("densityDiag"))
    ) + ggtitle(paste("Correlation matrix for", cat_var, "=", g))+
  theme_classic() +
  theme(axis.text.x = element_text(angle = 60, hjust = 1))
    
    return(p)
  })
  
  names(plots_list) <- groups
  return(plots_list)
}

numeric_var_names <- names(data)[numeric_vars]

#Plot of all plot for each categorical variable
plots_all_categorical <- lapply(categorical_vars, function(cat_var) {
  plot_grouped_ggpairs(data, cat_var, numeric_var_names)
})

#Plot of correlation matrix for variable 'Fumatrici'
print(plots_all_categorical[[1]])

#Plot of correlation matrix for variable 'Tipo.parto'
print(plots_all_categorical[[2]])

#Plot of correlation matrix for variable 'Ospedale'
print(plots_all_categorical[[3]])

#Plot of correlation matrix for variable 'Sesso'
print(plots_all_categorical[[4]])

The descriptive and correlation analysis highlighted that variables related to gestation and neonatal anthropometric measurements are strongly associated with newborn weight and represent key potential predictors. Categorical variables such as maternal smoking, delivery type, newborn sex, and hospital facility influence these relationships and their inclusion in the predictive model should be evaluated.

In particular, the attenuated effects observed in the subgroup of smoking mothers suggest considering interaction terms or differential effects in the model to better capture the dynamics present in the sample. Furthermore, the strong correlation among some anthropometric measurements requires appropriate handling of multicollinearity.

Therefore, the multiple regression model to be developed should include both relevant continuous and categorical variables, assess possible interactions, and implement suitable strategies to address multicollinearity and outliers, in order to ensure reliable and interpretable estimates for the prediction of newborn weight.

Regression model creation

In this phase of the project, a multiple linear regression model is developed to quantify the effect of the available variables on newborn birth weight. To this end, a procedure has been implemented that systematically presents the main results of the model, including:

#Creation of regression model with backward approach

#Function for model results
mod_results <- function(mod) {
  
  summ <- summary(mod)
  vif_values <- car::vif(mod)
  bp_test <- lmtest::bptest(mod)
  dw_test <- lmtest::dwtest(mod)
  shapiro_test <- shapiro.test(residuals(mod))
  
  #Coefficients table with p-value
  coefs <- as.data.frame(summ$coefficients)
  coefs <- round(coefs, 4)
  rownames(coefs) <- rownames(summ$coefficients)
  colnames(coefs) <- c("Estimate", "Std. Error", "t value", "Pr(>|t|)")
  
  print(kable(coefs, caption = "Coefficients of the Regression Model")%>%
          kableExtra::kable_styling(full_width = FALSE, position = "left"))
  
  #VIF table
  if (is.matrix(vif_values)) {
    vif_tab <- data.frame(
      Variable = rownames(vif_values),
      GVIF = round(vif_values[,"GVIF"], 3),
      GVIF_corr = round(vif_values[,"GVIF^(1/(2*Df))"], 3)
    )
  } else {
    vif_tab <- data.frame(
      Variable = names(vif_values),
      VIF = round(vif_values, 3)
    )
  }
  
  print(kable(vif_tab, caption = "Variance Inflation Factor (VIF)")%>%
          kableExtra::kable_styling(full_width = FALSE, position = "left"))
  
  #Diagnostic test table
  diag_tab <- data.frame(
    Test = c("Breusch-Pagan (Heteroscedasticity)",
             "Durbin-Watson (Autocorrelation)",
             "Shapiro-Wilk (Normality of Residuals)"),
    Statistic = c(round(bp_test$statistic, 3),
                  round(dw_test$statistic, 3),
                  round(shapiro_test$statistic, 3)),
    p_value = c(format.pval(bp_test$p.value, digits = 3, eps = 1e-3),
                format.pval(dw_test$p.value, digits = 3, eps = 1e-3),
                format.pval(shapiro_test$p.value, digits = 3, eps = 1e-3)),
    stringsAsFactors = FALSE
  )
  
  print(kable(diag_tab, caption = "Diagnostic Tests Summary")%>%
          kableExtra::kable_styling(full_width = FALSE, position = "left"))
  
  #Residuals plots
  par(mfrow = c(2,2))
  plot(mod)
  
  invisible(list(summary = summ,
                 vif = vif_values,
                 bptest = bp_test,
                 dwtest = dw_test,
                 shapiro = shapiro_test))
}

#Transform also other categorical variables in factor
data$Sesso <- factor(data$Sesso, levels = c("M", "F"))
data$Tipo.parto <- factor(data$Tipo.parto, levels = c("Nat", "Ces"))
data$Ospedale <- factor(data$Ospedale, levels = c("osp1", "osp2", "osp3"))

#Model1: all variables
mod1 <- lm(Peso ~., data = data)
mod_results(mod1)
Coefficients of the Regression Model
Estimate Std. Error t value Pr(>&#124;t&#124;)
(Intercept) -6628.5902 143.0078 -46.3512 0.0000
Anni.madre 0.8018 1.1467 0.6992 0.4845
N.gravidanze 11.3812 4.6686 2.4378 0.0148
FumatriciFumatrici -30.2741 27.5492 -1.0989 0.2719
Gestazione 32.5773 3.8208 8.5264 0.0000
Lunghezza 10.2922 0.3009 34.2070 0.0000
Cranio 10.4722 0.4263 24.5671 0.0000
Tipo.partoCes -29.6335 12.0905 -2.4510 0.0143
Ospedaleosp2 -11.0912 13.4471 -0.8248 0.4096
Ospedaleosp3 28.2495 13.5054 2.0917 0.0366
SessoF -77.5723 11.1865 -6.9344 0.0000
Variance Inflation Factor (VIF)
Variable GVIF GVIF_corr
Anni.madre Anni.madre 1.190 1.091
N.gravidanze N.gravidanze 1.189 1.091
Fumatrici Fumatrici 1.007 1.004
Gestazione Gestazione 1.696 1.302
Lunghezza Lunghezza 2.087 1.445
Cranio Cranio 1.631 1.277
Tipo.parto Tipo.parto 1.004 1.002
Ospedale Ospedale 1.004 1.001
Sesso Sesso 1.041 1.020
Diagnostic Tests Summary
Test Statistic p_value
BP Breusch-Pagan (Heteroscedasticity) 93.294 <0.001
DW Durbin-Watson (Autocorrelation) 1.952 0.114
W Shapiro-Wilk (Normality of Residuals) 0.974 <0.001

In the initial full model, which includes all available variables, certain features such as gestational age, length, and head circumference are strongly associated with birth weight, highlighting their key role in neonatal growth.

The number of previous pregnancies shows a positive but moderate effect, suggesting that maternal reproductive history may influence the baby’s weight.

Male sex is associated with a higher birth weight, consistent with established biological knowledge.

Conversely, variables such as the hospital of birth and the type of delivery, although included in the initial model, show an overall weak and non-significant effect. This suggests that local differences between hospital groups or delivery methods may have a limited impact on the overall model.

Best model selection

In this phase, starting from the model containing all variables in the dataset, a backward selection approach is adopted to identify the most parsimonious and adequate multiple regression model. The process unfolds through the following steps:

#Function for ANOVA, AIC and BIC test for models
compare_models <- function(mod_prev, mod_new) {
  #Extract terms of two models
  terms_prev <- attr(terms(mod_prev), "term.labels")
  terms_new <- attr(terms(mod_new), "term.labels")
  
  #Verify if the new model is nested
  is_nested <- all(terms_new %in% terms_prev)
  
  #Calculate AIC and BIC
  aic_tab <- AIC(mod_prev, mod_new)
  bic_tab <- BIC(mod_prev, mod_new)
  
  print(knitr::kable(aic_tab, caption = "Method AIC for best model chose") %>%
          kableExtra::kable_styling(full_width = FALSE, position = "left"))
  print(knitr::kable(bic_tab, caption = "Method BIC for best model chose") %>%
          kableExtra::kable_styling(full_width = FALSE, position = "left"))
  
  if (is_nested) {
    cat("Models are nested. Performing ANOVA test.\n")
    anova_res <- anova(mod_prev, mod_new)
    print(knitr::kable(anova_res, caption = "Test ANOVA results") %>%
            kableExtra::kable_styling(full_width = FALSE, position = "left"))
  } else {
    cat("Models are NOT nested. ANOVA test skipped.\n")
  }
}

#Model2: all variables minus Anni.madre (not significant variable)
mod2 <- update(mod1, ~.-Anni.madre)
mod_results(mod2)
Coefficients of the Regression Model
Estimate Std. Error t value Pr(>&#124;t&#124;)
(Intercept) -6601.3083 137.5680 -47.9858 0.0000
N.gravidanze 12.5833 4.3400 2.8994 0.0038
FumatriciFumatrici -30.4268 27.5455 -1.1046 0.2694
Gestazione 32.2996 3.7997 8.5006 0.0000
Lunghezza 10.2916 0.3008 34.2087 0.0000
Cranio 10.4874 0.4257 24.6375 0.0000
Tipo.partoCes -29.6654 12.0892 -2.4539 0.0142
Ospedaleosp2 -10.9509 13.4442 -0.8145 0.4154
Ospedaleosp3 28.5171 13.4986 2.1126 0.0347
SessoF -77.6452 11.1849 -6.9420 0.0000
Variance Inflation Factor (VIF)
Variable GVIF GVIF_corr
N.gravidanze N.gravidanze 1.028 1.014
Fumatrici Fumatrici 1.007 1.004
Gestazione Gestazione 1.677 1.295
Lunghezza Lunghezza 2.087 1.445
Cranio Cranio 1.627 1.275
Tipo.parto Tipo.parto 1.004 1.002
Ospedale Ospedale 1.003 1.001
Sesso Sesso 1.041 1.020
Diagnostic Tests Summary
Test Statistic p_value
BP Breusch-Pagan (Heteroscedasticity) 91.153 <0.001
DW Durbin-Watson (Autocorrelation) 1.953 0.12
W Shapiro-Wilk (Normality of Residuals) 0.974 <0.001

compare_models(mod1, mod2)
Method AIC for best model chose
df AIC
mod_prev 12 35145.57
mod_new 11 35144.06
Method BIC for best model chose
df BIC
mod_prev 12 35215.45
mod_new 11 35208.12
Models are nested. Performing ANOVA test.
Test ANOVA results
Res.Df RSS Df Sum of Sq F Pr(>F)
2487 186743194 NA NA NA NA
2488 186779904 -1 -36710 0.4888948 0.4844861
#Model3: all variables minus Ospedale (not significant variable)
mod3 <- update(mod2, ~.-Ospedale)
mod_results(mod3)
Coefficients of the Regression Model
Estimate Std. Error t value Pr(>&#124;t&#124;)
(Intercept) -6600.2816 137.6162 -47.9615 0.0000
N.gravidanze 12.9927 4.3439 2.9910 0.0028
FumatriciFumatrici -31.8823 27.5803 -1.1560 0.2478
Gestazione 32.5970 3.8039 8.5693 0.0000
Lunghezza 10.2684 0.3011 34.0985 0.0000
Cranio 10.5015 0.4262 24.6372 0.0000
Tipo.partoCes -30.4244 12.1041 -2.5136 0.0120
SessoF -78.1031 11.1998 -6.9736 0.0000
Variance Inflation Factor (VIF)
Variable VIF
N.gravidanze N.gravidanze 1.027
Fumatrici Fumatrici 1.007
Gestazione Gestazione 1.676
Lunghezza Lunghezza 2.085
Cranio Cranio 1.626
Tipo.parto Tipo.parto 1.004
Sesso Sesso 1.040
Diagnostic Tests Summary
Test Statistic p_value
BP Breusch-Pagan (Heteroscedasticity) 90.431 <0.001
DW Durbin-Watson (Autocorrelation) 1.954 0.124
W Shapiro-Wilk (Normality of Residuals) 0.974 <0.001

compare_models(mod2, mod3)
Method AIC for best model chose
df AIC
mod_prev 11 35144.06
mod_new 9 35149.33
Method BIC for best model chose
df BIC
mod_prev 11 35208.12
mod_new 9 35201.73
Models are nested. Performing ANOVA test.
Test ANOVA results
Res.Df RSS Df Sum of Sq F Pr(>F)
2488 186779904 NA NA NA NA
2490 187473818 -2 -693913.8 4.621636 0.0099213
#Model4: log(Peso) for manage eteroschedasticity shown in residuals plot of mod3
mod4 <- lm(log(Peso) ~ N.gravidanze + Fumatrici + Gestazione + Lunghezza + 
               Cranio + Tipo.parto + Sesso, data = data)
mod_results(mod4)
Coefficients of the Regression Model
Estimate Std. Error t value Pr(>&#124;t&#124;)
(Intercept) 4.4198 0.0434 101.8054 0.0000
N.gravidanze 0.0041 0.0014 3.0097 0.0026
FumatriciFumatrici -0.0065 0.0087 -0.7443 0.4568
Gestazione 0.0186 0.0012 15.5179 0.0000
Lunghezza 0.0035 0.0001 37.2147 0.0000
Cranio 0.0035 0.0001 26.1255 0.0000
Tipo.partoCes -0.0070 0.0038 -1.8417 0.0656
SessoF -0.0181 0.0035 -5.1175 0.0000
Variance Inflation Factor (VIF)
Variable VIF
N.gravidanze N.gravidanze 1.027
Fumatrici Fumatrici 1.007
Gestazione Gestazione 1.676
Lunghezza Lunghezza 2.085
Cranio Cranio 1.626
Tipo.parto Tipo.parto 1.004
Sesso Sesso 1.040
Diagnostic Tests Summary
Test Statistic p_value
BP Breusch-Pagan (Heteroscedasticity) 169.739 <0.001
DW Durbin-Watson (Autocorrelation) 1.945 0.0848
W Shapiro-Wilk (Normality of Residuals) 0.974 <0.001

compare_models(mod3, mod4)
Method AIC for best model chose
df AIC
mod_prev 9 35149.325
mod_new 9 -5125.624
Method BIC for best model chose
df BIC
mod_prev 9 35201.734
mod_new 9 -5073.215
Models are nested. Performing ANOVA test.
Test ANOVA results
Df Sum Sq Mean Sq F value Pr(>F)
N.gravidanze 1 3.571804e+03 3.571804e+03 0.0474402 0.8275968
Fumatrici 1 2.516883e+05 2.516883e+05 3.3428881 0.0676156
Gestazione 1 2.450098e+08 2.450098e+08 3254.1845859 0.0000000
Lunghezza 1 2.048849e+08 2.048849e+08 2721.2517384 0.0000000
Cranio 1 4.707085e+07 4.707085e+07 625.1882273 0.0000000
Tipo.parto 1 4.810017e+05 4.810017e+05 6.3885943 0.0115470
Sesso 1 3.661486e+06 3.661486e+06 48.6313237 0.0000000
Residuals 2490 1.874738e+08 7.529069e+04 NA NA
#Model5: interaction beetwen Fumatrici and Lunghezza + Fumatrici and Gestazione
mod5 <- lm(log(Peso) ~ N.gravidanze + Fumatrici * Lunghezza + Fumatrici * Gestazione + Cranio + Tipo.parto + Sesso, data = data)
mod_results(mod5)
Coefficients of the Regression Model
Estimate Std. Error t value Pr(>&#124;t&#124;)
(Intercept) 4.4134 0.0437 101.0835 0.0000
N.gravidanze 0.0043 0.0014 3.1212 0.0018
FumatriciFumatrici 0.3322 0.2556 1.2996 0.1939
Lunghezza 0.0035 0.0001 36.4149 0.0000
Gestazione 0.0193 0.0012 15.8743 0.0000
Cranio 0.0035 0.0001 26.1248 0.0000
Tipo.partoCes -0.0073 0.0038 -1.9213 0.0548
SessoF -0.0183 0.0035 -5.1684 0.0000
FumatriciFumatrici:Lunghezza 0.0011 0.0005 2.2917 0.0220
FumatriciFumatrici:Gestazione -0.0219 0.0070 -3.1418 0.0017
Variance Inflation Factor (VIF)
Variable VIF
N.gravidanze N.gravidanze 1.028
Fumatrici Fumatrici 872.074
Lunghezza Lunghezza 2.138
Gestazione Gestazione 1.730
Cranio Cranio 1.627
Tipo.parto Tipo.parto 1.004
Sesso Sesso 1.043
Fumatrici:Lunghezza Fumatrici:Lunghezza 695.009
Fumatrici:Gestazione Fumatrici:Gestazione 1004.897
Diagnostic Tests Summary
Test Statistic p_value
BP Breusch-Pagan (Heteroscedasticity) 176.993 <0.001
DW Durbin-Watson (Autocorrelation) 1.949 0.0997
W Shapiro-Wilk (Normality of Residuals) 0.975 <0.001

compare_models(mod4, mod5)
Method AIC for best model chose
df AIC
mod_prev 9 -5125.624
mod_new 11 -5132.234
Method BIC for best model chose
df BIC
mod_prev 9 -5073.215
mod_new 11 -5068.178

Models are NOT nested. ANOVA test skipped.

#Model5 centered: centering of Lunghezza and Gestazione to manage multicollinearity
data$Lunghezza_c <- scale(data$Lunghezza, scale=FALSE)
data$Gestazione_c <- scale(data$Gestazione, scale=FALSE)

mod5_cent <- lm(log(Peso) ~ N.gravidanze + Fumatrici * Gestazione_c + Fumatrici*Lunghezza_c + Cranio + Tipo.parto + Sesso, data=data)
mod_results(mod5_cent)
Coefficients of the Regression Model
Estimate Std. Error t value Pr(>&#124;t&#124;)
(Intercept) 6.8967 0.0457 150.8700 0.0000
N.gravidanze 0.0043 0.0014 3.1212 0.0018
FumatriciFumatrici 0.0023 0.0091 0.2512 0.8017
Gestazione_c 0.0193 0.0012 15.8743 0.0000
Lunghezza_c 0.0035 0.0001 36.4149 0.0000
Cranio 0.0035 0.0001 26.1248 0.0000
Tipo.partoCes -0.0073 0.0038 -1.9213 0.0548
SessoF -0.0183 0.0035 -5.1684 0.0000
FumatriciFumatrici:Gestazione_c -0.0219 0.0070 -3.1418 0.0017
FumatriciFumatrici:Lunghezza_c 0.0011 0.0005 2.2917 0.0220
Variance Inflation Factor (VIF)
Variable VIF
N.gravidanze N.gravidanze 1.028
Fumatrici Fumatrici 1.105
Gestazione_c Gestazione_c 1.730
Lunghezza_c Lunghezza_c 2.138
Cranio Cranio 1.627
Tipo.parto Tipo.parto 1.004
Sesso Sesso 1.043
Fumatrici:Gestazione_c Fumatrici:Gestazione_c 1.416
Fumatrici:Lunghezza_c Fumatrici:Lunghezza_c 1.387
Diagnostic Tests Summary
Test Statistic p_value
BP Breusch-Pagan (Heteroscedasticity) 176.993 <0.001
DW Durbin-Watson (Autocorrelation) 1.949 0.0997
W Shapiro-Wilk (Normality of Residuals) 0.975 <0.001

compare_models(mod5, mod5_cent)
Method AIC for best model chose
df AIC
mod_prev 11 -5132.234
mod_new 11 -5132.234
Method BIC for best model chose
df BIC
mod_prev 11 -5068.178
mod_new 11 -5068.178

Models are NOT nested. ANOVA test skipped.

#Model6: interaction beetwen Gestazione and Lunghezza
mod6_cent <- lm(log(Peso) ~ N.gravidanze + Fumatrici * Gestazione_c + Fumatrici*Lunghezza_c + Cranio + Sesso + Gestazione_c*Lunghezza_c, data=data)
mod_results(mod6_cent)
Coefficients of the Regression Model
Estimate Std. Error t value Pr(>&#124;t&#124;)
(Intercept) 6.9523 0.0448 155.3155 0.0000
N.gravidanze 0.0038 0.0013 2.8658 0.0042
FumatriciFumatrici -0.0002 0.0089 -0.0178 0.9858
Gestazione_c 0.0132 0.0013 10.1521 0.0000
Lunghezza_c 0.0033 0.0001 34.2855 0.0000
Cranio 0.0034 0.0001 25.5879 0.0000
SessoF -0.0222 0.0035 -6.4110 0.0000
FumatriciFumatrici:Gestazione_c -0.0186 0.0068 -2.7279 0.0064
FumatriciFumatrici:Lunghezza_c 0.0012 0.0005 2.6452 0.0082
Gestazione_c:Lunghezza_c -0.0002 0.0000 -11.7191 0.0000
Variance Inflation Factor (VIF)
Variable VIF
N.gravidanze N.gravidanze 1.028
Fumatrici Fumatrici 1.106
Gestazione_c Gestazione_c 2.070
Lunghezza_c Lunghezza_c 2.220
Cranio Cranio 1.641
Sesso Sesso 1.053
Fumatrici:Gestazione_c Fumatrici:Gestazione_c 1.417
Fumatrici:Lunghezza_c Fumatrici:Lunghezza_c 1.388
Gestazione_c:Lunghezza_c Gestazione_c:Lunghezza_c 1.734
Diagnostic Tests Summary
Test Statistic p_value
BP Breusch-Pagan (Heteroscedasticity) 147.794 <0.001
DW Durbin-Watson (Autocorrelation) 1.946 0.0879
W Shapiro-Wilk (Normality of Residuals) 0.975 <0.001

compare_models(mod5_cent, mod6_cent)
Method AIC for best model chose
df AIC
mod_prev 11 -5132.234
mod_new 11 -5262.748
Method BIC for best model chose
df BIC
mod_prev 11 -5068.178
mod_new 11 -5198.693

Models are NOT nested. ANOVA test skipped.

#Model7: add of non linear term of Gestazione and eliminate interaction beetwen Gestazione and Lunghezza (for reduce multicollinearity)
data$Gestazione_c2 <- data$Gestazione_c^2
data$Lunghezza_c2 <- data$Lunghezza_c^2

mod7_cent <- lm(log(Peso) ~ N.gravidanze + Fumatrici*Gestazione_c + Gestazione_c2 +
              Fumatrici*Lunghezza_c + Lunghezza_c + 
              Cranio + Sesso, data=data)
mod_results(mod7_cent)
Coefficients of the Regression Model
Estimate Std. Error t value Pr(>&#124;t&#124;)
(Intercept) 6.9544 0.0447 155.4749 0.0000
N.gravidanze 0.0041 0.0013 3.0477 0.0023
FumatriciFumatrici -0.0011 0.0089 -0.1238 0.9015
Gestazione_c 0.0117 0.0013 8.7372 0.0000
Gestazione_c2 -0.0024 0.0002 -11.9303 0.0000
Lunghezza_c 0.0033 0.0001 35.1399 0.0000
Cranio 0.0034 0.0001 25.6199 0.0000
SessoF -0.0216 0.0034 -6.2728 0.0000
FumatriciFumatrici:Gestazione_c -0.0152 0.0068 -2.2305 0.0258
FumatriciFumatrici:Lunghezza_c 0.0011 0.0005 2.4925 0.0127
Variance Inflation Factor (VIF)
Variable VIF
N.gravidanze N.gravidanze 1.028
Fumatrici Fumatrici 1.106
Gestazione_c Gestazione_c 2.228
Gestazione_c2 Gestazione_c2 1.837
Lunghezza_c Lunghezza_c 2.183
Cranio Cranio 1.641
Sesso Sesso 1.051
Fumatrici:Gestazione_c Fumatrici:Gestazione_c 1.424
Fumatrici:Lunghezza_c Fumatrici:Lunghezza_c 1.387
Diagnostic Tests Summary
Test Statistic p_value
BP Breusch-Pagan (Heteroscedasticity) 158.233 <0.001
DW Durbin-Watson (Autocorrelation) 1.951 0.111
W Shapiro-Wilk (Normality of Residuals) 0.978 <0.001

compare_models(mod6_cent, mod7_cent)
Method AIC for best model chose
df AIC
mod_prev 11 -5262.748
mod_new 11 -5267.497
Method BIC for best model chose
df BIC
mod_prev 11 -5198.693
mod_new 11 -5203.442

Models are NOT nested. ANOVA test skipped.

#Resume of best model selection with AIC and BIC
knitr::kable(AIC(mod1, mod2, mod3, mod4, mod5, mod5_cent, mod6_cent, mod7_cent), caption = "Resume of results for method AIC for best model")
Resume of results for method AIC for best model
df AIC
mod1 12 35145.571
mod2 11 35144.062
mod3 9 35149.325
mod4 9 -5125.624
mod5 11 -5132.234
mod5_cent 11 -5132.234
mod6_cent 11 -5262.748
mod7_cent 11 -5267.497
knitr::kable(BIC(mod1, mod2, mod3, mod4, mod5, mod5_cent, mod6_cent, mod7_cent), caption = "Resume of results for method BIC for best model")
Resume of results for method BIC for best model
df BIC
mod1 12 35215.450
mod2 11 35208.118
mod3 9 35201.734
mod4 9 -5073.215
mod5 11 -5068.178
mod5_cent 11 -5068.178
mod6_cent 11 -5198.693
mod7_cent 11 -5203.442

The final model demonstrates a good balance between simplicity and predictive performance, as highlighted by the AIC, BIC criteria, and coefficient tests. Diagnostic analysis verifies the fundamental assumptions of linear regression and indicates:

The most relevant variables for predicting neonatal weight include gestational duration, which exhibits a non-linear effect with a more marked increase in the early stages followed by stabilization, head circumference, male sex, and the number of previous pregnancies. Additionally, maternal smoking interacts complexly with gestational age and newborn length, highlighting the importance of considering such interactions in the model.

This summary provides a robust and interpretable model that faithfully describes the data dynamics and lays solid foundations for further considerations.

Model quality analysis

To ensure the model’s robustness and reliability, an analysis is conducted to identify outliers, high-leverage points, and influential observations that could disproportionately impact the estimates. This step is particularly important in light of the heteroscedasticity revealed by the preliminary diagnostics.

The investigation relies on established metrics such as leverage, studentized residuals, and Cook’s distance, complemented by graphical visualizations and tables of the most relevant observations. This approach allows distinguishing between anomalies or errors and biologically plausible but rare values, thus supporting informed decisions on potential model refinements or further analyses.

#Function to diagnosis of outlier and leverage points
diagnostic_plots_ggplot <- function(mod) {
  
  n <- length(residuals(mod))
  p <- length(coefficients(mod))
  
  data_diag <- data.frame(
    Index = 1:n,
    Leverage = hatvalues(mod),
    RStudent = rstudent(mod),
    CooksDistance = cooks.distance(mod)
  )
  
  #Cutoff for influential points
  leverage_cutoff <- 2 * p / n
  rstudent_cutoff <- 2
  cook_cutoff <- 4 / (n - p)
  
  #Flag influential points
  data_diag <- data_diag %>%
    mutate(
      HighLeverage = Leverage > leverage_cutoff,
      Outlier = abs(RStudent) > rstudent_cutoff,
      Influential = CooksDistance > cook_cutoff
    )
  
  #Function to general plot
  plot_diag <- function(df, yvar, ylab, cutoff, flagvar) {
    ggplot(df, aes(x = Index, y = !!sym(yvar))) +
      geom_point(aes(color = !!sym(flagvar))) +
      geom_line() +
      geom_hline(yintercept = cutoff, linetype = "dashed", color = "red") +
      scale_color_manual(values = c("FALSE" = "black", "TRUE" = "red")) +
      labs(title = paste(ylab, "with Influential Points Highlighted"),
           y = ylab, x = "Index") +
      theme_classic()
  }
  
  #Plots
  p1 <- plot_diag(data_diag, "Leverage", "Leverage (hat values)", leverage_cutoff, "HighLeverage")
  p2 <- plot_diag(data_diag, "RStudent", "Residuals student", rstudent_cutoff, "Outlier")
  p3 <- plot_diag(data_diag, "CooksDistance", "Cook's distance", cook_cutoff, "Influential")
  
  #Influential points table
  influential_points <- data_diag %>%
    filter(HighLeverage | Outlier | Influential) %>%
    arrange(desc(CooksDistance)) %>%
    dplyr::select(Index, Leverage, RStudent, CooksDistance, HighLeverage, Outlier, Influential)
  
  #table_kable <- knitr::kable(influential_points, caption = "Table of influential points")
  
  #Return plots
  return(list(LeveragePlot = p1, ResidualsPlot = p2, CookPlot = p3, InfluentialPoints =   influential_points))
}

res <- diagnostic_plots_ggplot(mod7_cent)

#Plots of outlier, leverage and Cook's distance
print(res$LeveragePlot)

print(res$ResidualsPlot)

print(res$CookPlot)

#Extract of influential data
influential_indices <- res$InfluentialPoints$Index
influential_data <- data[influential_indices, 1:10]
influential_data$Index <- influential_indices

#Influential points of dataset
diagnostics_info <- res$InfluentialPoints %>%
  dplyr::select(Index, HighLeverage, Outlier, Influential) 

#Built of join
final_influential <- inner_join(influential_data, diagnostics_info, by = "Index")

kable(final_influential, caption = "Data and diagnosis of influential points") %>%
  kable_styling(full_width = FALSE, position = "left")
Data and diagnosis of influential points
Anni.madre N.gravidanze Fumatrici Gestazione Peso Lunghezza Cranio Tipo.parto Ospedale Sesso Index HighLeverage Outlier Influential
35 1 NonFumatrici 38 4370 315 374 Nat osp3 F 1549 TRUE TRUE TRUE
25 2 NonFumatrici 25 900 325 253 Nat osp3 F 1778 TRUE TRUE TRUE
24 4 NonFumatrici 29 1280 390 355 Nat osp1 F 1427 TRUE TRUE TRUE
26 0 Fumatrici 39 4930 550 350 Ces osp2 F 1918 TRUE FALSE TRUE
40 3 NonFumatrici 28 1560 420 379 Nat osp3 F 310 TRUE TRUE TRUE
28 0 NonFumatrici 26 930 345 245 Ces osp3 F 2450 TRUE FALSE TRUE
30 0 NonFumatrici 36 3610 410 330 Nat osp1 M 155 TRUE TRUE TRUE
30 1 NonFumatrici 36 1280 385 292 Nat osp2 F 1426 TRUE TRUE TRUE
32 0 NonFumatrici 27 1140 370 267 Nat osp3 F 2118 TRUE FALSE TRUE
35 1 NonFumatrici 32 1890 500 309 Nat osp2 F 2113 TRUE TRUE TRUE
28 1 NonFumatrici 27 980 320 265 Nat osp1 M 2435 TRUE FALSE TRUE
32 4 Fumatrici 39 2250 460 319 Nat osp2 F 1424 TRUE FALSE TRUE
31 0 NonFumatrici 31 990 340 278 Ces osp2 F 1617 TRUE TRUE TRUE
34 1 Fumatrici 40 3180 470 326 Nat osp1 F 1396 TRUE FALSE TRUE
41 3 NonFumatrici 35 1500 420 304 Nat osp1 M 1591 FALSE TRUE TRUE
27 1 NonFumatrici 38 3240 410 359 Ces osp1 F 2038 TRUE TRUE TRUE
31 3 Fumatrici 37 3440 460 362 Ces osp1 M 2335 TRUE FALSE TRUE
27 0 NonFumatrici 35 3140 465 290 Nat osp2 F 2223 FALSE TRUE TRUE
28 3 Fumatrici 41 2940 465 328 Nat osp2 F 2351 TRUE FALSE TRUE
30 4 NonFumatrici 35 4520 520 360 Nat osp2 F 1551 TRUE TRUE TRUE
26 0 Fumatrici 40 3400 525 350 Nat osp2 M 391 TRUE FALSE TRUE
22 1 Fumatrici 40 3350 520 355 Nat osp3 M 99 TRUE FALSE TRUE
25 1 Fumatrici 38 3280 475 324 Nat osp1 F 567 TRUE FALSE TRUE
29 1 Fumatrici 41 4420 530 362 Ces osp2 F 140 TRUE FALSE TRUE
33 1 Fumatrici 38 3030 500 353 Nat osp3 F 1624 TRUE FALSE TRUE
18 0 NonFumatrici 40 1850 460 305 Nat osp3 F 295 FALSE TRUE TRUE
44 0 Fumatrici 38 3150 465 335 Nat osp3 F 335 TRUE FALSE TRUE
33 0 NonFumatrici 29 1620 410 292 Nat osp3 F 1383 TRUE FALSE TRUE
20 0 Fumatrici 42 3870 530 381 Ces osp1 F 2014 TRUE FALSE TRUE
35 6 Fumatrici 38 2430 460 324 Nat osp2 F 442 TRUE FALSE TRUE
30 7 NonFumatrici 35 2220 470 316 Nat osp3 M 582 TRUE TRUE TRUE
42 2 NonFumatrici 38 2560 525 349 Ces osp2 M 1397 FALSE TRUE TRUE
30 3 NonFumatrici 38 4600 485 380 Nat osp1 M 1292 FALSE TRUE TRUE
24 0 NonFumatrici 35 1450 405 280 Nat osp1 F 750 FALSE TRUE TRUE
31 0 NonFumatrici 34 1370 390 287 Nat osp2 F 101 FALSE TRUE TRUE
33 10 NonFumatrici 40 3090 485 353 Nat osp3 M 2420 TRUE FALSE TRUE
36 8 NonFumatrici 41 3730 480 335 Nat osp3 M 1448 TRUE FALSE TRUE
28 0 NonFumatrici 39 3800 520 300 Nat osp3 F 1710 FALSE TRUE TRUE
28 0 Fumatrici 41 3820 500 350 Nat osp2 M 424 TRUE FALSE TRUE
35 2 Fumatrici 39 4030 510 344 Ces osp2 M 2410 TRUE FALSE TRUE
23 1 NonFumatrici 36 3850 460 334 Ces osp3 F 1692 FALSE TRUE TRUE
32 2 NonFumatrici 39 3430 445 322 Ces osp1 F 1633 FALSE TRUE TRUE
31 2 Fumatrici 38 3350 510 348 Nat osp1 M 2280 TRUE FALSE TRUE
35 10 NonFumatrici 39 2950 495 335 Nat osp1 F 2219 TRUE FALSE TRUE
30 1 NonFumatrici 39 3000 475 390 Ces osp2 F 684 TRUE TRUE TRUE
29 0 NonFumatrici 40 3470 525 390 Nat osp1 M 1866 FALSE TRUE TRUE
25 1 NonFumatrici 37 1750 430 305 Nat osp3 F 322 FALSE TRUE TRUE
30 2 NonFumatrici 39 3790 505 304 Ces osp3 M 1393 FALSE TRUE TRUE
22 0 NonFumatrici 32 2340 445 304 Nat osp1 F 1356 TRUE TRUE TRUE
27 4 Fumatrici 41 3350 485 335 Ces osp3 M 1261 TRUE FALSE TRUE
32 4 Fumatrici 39 3400 510 356 Nat osp1 M 1284 TRUE FALSE TRUE
21 1 Fumatrici 40 2590 480 328 Ces osp1 F 413 TRUE FALSE TRUE
33 0 NonFumatrici 36 3110 465 300 Ces osp3 M 1854 FALSE TRUE TRUE
31 0 NonFumatrici 40 3410 550 372 Nat osp2 M 119 FALSE TRUE TRUE
33 0 NonFumatrici 30 1750 410 294 Nat osp2 M 2198 TRUE FALSE TRUE
24 0 NonFumatrici 42 2800 520 340 Ces osp2 M 2313 FALSE TRUE TRUE
34 4 NonFumatrici 42 2660 500 320 Nat osp2 F 1716 TRUE TRUE TRUE
19 1 NonFumatrici 38 1950 445 305 Nat osp2 M 1741 FALSE TRUE TRUE
35 1 Fumatrici 41 2910 490 340 Nat osp3 M 2108 TRUE FALSE TRUE
29 3 Fumatrici 41 3090 495 350 Nat osp3 M 473 TRUE FALSE TRUE
22 1 NonFumatrici 34 3030 470 312 Nat osp2 F 1891 FALSE TRUE TRUE
28 1 NonFumatrici 40 3790 460 332 Nat osp2 F 1267 FALSE TRUE TRUE
40 8 NonFumatrici 38 3520 470 341 Nat osp3 M 516 TRUE FALSE TRUE
28 0 Fumatrici 40 2740 500 325 Nat osp3 F 2001 TRUE FALSE TRUE
37 2 Fumatrici 41 3220 500 357 Ces osp3 M 1689 TRUE FALSE TRUE
27 2 Fumatrici 39 2920 490 351 Ces osp3 M 1523 TRUE FALSE TRUE
27 0 Fumatrici 42 3970 515 360 Nat osp2 M 1658 TRUE FALSE TRUE
23 0 NonFumatrici 41 4900 510 352 Nat osp2 F 1305 FALSE TRUE TRUE
31 4 Fumatrici 38 3530 495 335 Nat osp1 F 1969 TRUE FALSE TRUE
30 8 NonFumatrici 39 2860 490 337 Ces osp2 F 1503 TRUE FALSE TRUE
26 3 NonFumatrici 31 1960 420 300 Nat osp2 F 1067 TRUE FALSE TRUE
27 0 Fumatrici 41 3160 490 323 Nat osp1 F 1026 TRUE FALSE TRUE
23 1 NonFumatrici 40 3520 445 363 Nat osp1 F 220 FALSE FALSE TRUE
26 1 NonFumatrici 30 1170 370 273 Nat osp3 M 2305 TRUE FALSE TRUE
26 8 NonFumatrici 40 3250 500 355 Nat osp2 M 2084 TRUE FALSE TRUE
29 1 NonFumatrici 41 4010 470 353 Nat osp2 M 1229 FALSE TRUE TRUE
38 4 NonFumatrici 38 4370 530 340 Nat osp3 F 1960 FALSE TRUE TRUE
34 10 NonFumatrici 38 2880 470 345 Ces osp2 M 2469 TRUE FALSE TRUE
38 0 NonFumatrici 40 3700 470 320 Nat osp1 M 390 FALSE TRUE TRUE
38 3 NonFumatrici 41 2320 450 330 Nat osp2 F 656 FALSE TRUE TRUE
29 1 Fumatrici 40 4180 520 355 Ces osp1 M 2047 TRUE FALSE TRUE
34 2 NonFumatrici 33 1410 380 295 Nat osp2 F 492 TRUE FALSE TRUE
41 1 NonFumatrici 39 2160 450 325 Ces osp3 M 1193 FALSE TRUE TRUE
37 0 NonFumatrici 41 2420 490 300 Ces osp1 M 1554 FALSE TRUE TRUE
22 0 NonFumatrici 32 1430 380 301 Nat osp1 M 1699 TRUE FALSE TRUE
40 6 NonFumatrici 34 1840 430 305 Nat osp1 F 1310 TRUE FALSE TRUE
22 0 Fumatrici 39 3290 510 346 Nat osp3 M 1761 TRUE FALSE TRUE
32 1 NonFumatrici 33 2040 480 307 Ces osp1 F 1272 FALSE FALSE TRUE
26 1 Fumatrici 38 3060 490 350 Nat osp3 M 2283 TRUE FALSE TRUE
30 2 Fumatrici 40 3130 480 320 Nat osp1 M 1759 TRUE FALSE TRUE
38 6 NonFumatrici 37 3950 500 350 Nat osp3 M 134 FALSE FALSE TRUE
27 1 NonFumatrici 35 2100 440 345 Ces osp2 F 615 FALSE TRUE TRUE
29 0 NonFumatrici 42 3540 540 368 Nat osp1 M 616 FALSE TRUE TRUE
25 6 NonFumatrici 38 2530 460 340 Nat osp1 F 2315 TRUE FALSE TRUE
33 2 Fumatrici 40 3320 490 320 Nat osp3 M 1650 TRUE FALSE TRUE
32 3 NonFumatrici 36 3780 480 349 Nat osp3 M 1340 FALSE TRUE TRUE
30 0 NonFumatrici 38 4540 530 343 Ces osp3 M 1539 FALSE TRUE TRUE
22 0 NonFumatrici 42 3100 510 361 Nat osp2 F 1586 FALSE TRUE TRUE
31 0 NonFumatrici 37 2040 465 305 Nat osp2 F 1508 FALSE TRUE TRUE
35 0 NonFumatrici 33 1390 390 277 Nat osp1 F 748 TRUE FALSE TRUE
23 1 Fumatrici 41 3590 500 349 Nat osp3 F 105 TRUE FALSE TRUE
38 3 Fumatrici 40 3300 510 345 Nat osp3 M 234 TRUE FALSE TRUE
40 0 NonFumatrici 34 1580 400 300 Nat osp2 F 2255 FALSE FALSE TRUE
27 1 NonFumatrici 38 2480 500 315 Ces osp2 M 1913 FALSE TRUE TRUE
37 3 Fumatrici 39 3150 470 341 Nat osp1 F 1395 TRUE FALSE TRUE
32 5 NonFumatrici 36 2400 470 333 Nat osp1 F 1008 FALSE FALSE TRUE
35 1 NonFumatrici 39 2660 460 364 Ces osp1 F 1400 FALSE FALSE TRUE
37 1 NonFumatrici 39 2500 490 352 Ces osp1 M 2217 FALSE TRUE TRUE
29 4 NonFumatrici 30 1340 400 273 Ces osp1 M 106 TRUE FALSE TRUE
26 1 NonFumatrici 30 1170 370 266 Nat osp2 M 1247 TRUE FALSE TRUE
30 1 NonFumatrici 41 4440 510 335 Nat osp3 M 791 FALSE TRUE TRUE
32 6 NonFumatrici 40 4200 510 350 Nat osp1 M 828 FALSE FALSE TRUE
35 9 NonFumatrici 42 3760 540 348 Nat osp2 F 161 TRUE FALSE TRUE
33 3 NonFumatrici 40 3640 470 337 Nat osp3 F 2285 FALSE TRUE TRUE
39 2 NonFumatrici 35 1980 450 308 Ces osp1 M 724 FALSE TRUE TRUE
20 0 Fumatrici 38 2990 470 330 Nat osp2 F 557 TRUE FALSE TRUE
30 0 NonFumatrici 40 2380 480 305 Nat osp2 M 1217 FALSE FALSE TRUE
27 1 NonFumatrici 38 2150 450 315 Ces osp3 M 458 FALSE TRUE TRUE
26 0 NonFumatrici 34 2050 460 325 Nat osp1 F 1211 FALSE FALSE TRUE
37 4 NonFumatrici 37 3720 480 355 Ces osp1 F 1470 FALSE FALSE TRUE
39 1 NonFumatrici 31 1500 405 295 Nat osp3 M 206 TRUE FALSE FALSE
25 0 NonFumatrici 39 3290 450 338 Nat osp3 F 2056 FALSE TRUE FALSE
30 2 Fumatrici 37 3280 500 344 Ces osp1 M 128 TRUE FALSE FALSE
40 2 Fumatrici 38 3440 495 337 Nat osp3 F 1900 TRUE FALSE FALSE
27 2 NonFumatrici 39 2260 460 330 Nat osp2 F 1486 FALSE TRUE FALSE
37 2 NonFumatrici 36 3710 480 350 Nat osp3 M 2130 FALSE TRUE FALSE
18 0 NonFumatrici 36 3050 460 315 Nat osp2 F 1556 FALSE TRUE FALSE
23 2 Fumatrici 40 2950 495 340 Ces osp2 F 1037 TRUE FALSE FALSE
22 1 Fumatrici 40 3670 500 340 Nat osp1 M 120 TRUE FALSE FALSE
22 0 NonFumatrici 32 2580 470 330 Nat osp1 F 2214 TRUE FALSE FALSE
33 2 NonFumatrici 39 3660 490 318 Nat osp2 F 890 FALSE TRUE FALSE
38 1 NonFumatrici 40 3980 480 335 Nat osp1 F 2193 FALSE TRUE FALSE
19 0 NonFumatrici 36 2100 445 332 Ces osp1 F 1754 FALSE TRUE FALSE
22 0 Fumatrici 40 4220 510 375 Nat osp1 M 1581 TRUE FALSE FALSE
23 3 Fumatrici 41 2620 475 313 Nat osp3 M 2218 TRUE FALSE FALSE
30 0 Fumatrici 38 3380 490 340 Nat osp3 F 1169 TRUE FALSE FALSE
26 1 NonFumatrici 32 1280 360 276 Nat osp2 M 312 TRUE FALSE FALSE
31 1 NonFumatrici 39 3400 460 325 Nat osp1 M 377 FALSE TRUE FALSE
30 2 NonFumatrici 39 4240 485 352 Nat osp2 M 130 FALSE TRUE FALSE
18 1 Fumatrici 40 3600 500 335 Nat osp3 M 296 TRUE FALSE FALSE
35 2 NonFumatrici 39 3300 460 320 Ces osp1 M 908 FALSE TRUE FALSE
33 3 NonFumatrici 39 3950 480 345 Nat osp3 M 2133 FALSE TRUE FALSE
24 1 NonFumatrici 40 3820 500 320 Ces osp2 F 146 FALSE TRUE FALSE
36 1 NonFumatrici 41 2940 500 358 Nat osp3 M 1935 FALSE TRUE FALSE
30 1 NonFumatrici 39 2380 495 325 Nat osp1 F 1497 FALSE TRUE FALSE
27 2 Fumatrici 37 2850 460 335 Nat osp1 F 251 TRUE FALSE FALSE
33 0 NonFumatrici 37 2230 455 326 Nat osp1 M 1748 FALSE TRUE FALSE
26 0 Fumatrici 40 3570 500 335 Nat osp2 M 1456 TRUE FALSE FALSE
21 1 Fumatrici 41 3270 510 340 Nat osp2 M 1904 TRUE FALSE FALSE
21 0 NonFumatrici 37 2750 510 333 Nat osp1 M 632 FALSE TRUE FALSE
30 0 Fumatrici 40 3120 500 344 Nat osp3 M 984 TRUE FALSE FALSE
27 0 NonFumatrici 28 1285 400 274 Nat osp1 F 378 TRUE FALSE FALSE
22 2 NonFumatrici 38 2410 470 336 Nat osp2 M 364 FALSE TRUE FALSE
24 2 NonFumatrici 40 2560 490 323 Nat osp1 M 1191 FALSE TRUE FALSE
28 1 Fumatrici 40 2850 490 334 Nat osp1 F 2143 TRUE FALSE FALSE
35 3 NonFumatrici 40 2660 485 335 Ces osp3 M 1286 FALSE TRUE FALSE
35 1 NonFumatrici 38 2850 500 360 Nat osp1 F 648 FALSE TRUE FALSE
25 2 NonFumatrici 41 3990 495 335 Nat osp3 M 472 FALSE TRUE FALSE
32 2 Fumatrici 38 2920 480 348 Nat osp2 F 1051 TRUE FALSE FALSE
28 1 NonFumatrici 40 2410 470 319 Nat osp1 M 417 FALSE TRUE FALSE
34 0 NonFumatrici 39 3180 470 310 Nat osp3 F 195 FALSE TRUE FALSE
26 2 Fumatrici 39 3530 505 332 Ces osp2 F 194 TRUE FALSE FALSE
26 0 NonFumatrici 40 4330 500 355 Nat osp3 F 1036 FALSE TRUE FALSE
24 0 NonFumatrici 38 2400 460 335 Nat osp3 M 1877 FALSE TRUE FALSE
21 0 NonFumatrici 41 2700 500 336 Nat osp2 F 653 FALSE TRUE FALSE
37 8 NonFumatrici 28 930 355 235 Nat osp1 F 2173 TRUE FALSE FALSE
30 0 NonFumatrici 40 3860 480 338 Nat osp1 M 633 FALSE TRUE FALSE
24 1 Fumatrici 39 3400 485 344 Nat osp1 F 242 TRUE FALSE FALSE
25 0 NonFumatrici 41 2980 500 356 Nat osp3 M 950 FALSE TRUE FALSE
37 3 NonFumatrici 33 2000 470 293 Ces osp1 F 1608 TRUE FALSE FALSE
21 0 NonFumatrici 39 3580 485 320 Nat osp3 M 318 FALSE TRUE FALSE
27 1 NonFumatrici 39 4650 510 354 Nat osp2 M 2021 FALSE TRUE FALSE
34 3 NonFumatrici 40 4580 515 360 Nat osp2 M 1836 FALSE TRUE FALSE
21 0 NonFumatrici 39 3950 500 330 Nat osp1 F 1132 FALSE TRUE FALSE
26 0 NonFumatrici 40 3580 480 326 Nat osp1 F 418 FALSE TRUE FALSE
17 0 NonFumatrici 37 2050 390 295 Nat osp2 F 1014 TRUE FALSE FALSE
35 1 Fumatrici 39 2440 450 326 Nat osp1 F 1332 TRUE FALSE FALSE
20 0 NonFumatrici 38 3700 480 335 Nat osp3 F 5 FALSE TRUE FALSE
20 1 NonFumatrici 40 2510 480 322 Ces osp2 M 1419 FALSE TRUE FALSE
30 1 NonFumatrici 32 2260 440 322 Ces osp3 F 592 TRUE FALSE FALSE
30 6 NonFumatrici 35 2680 450 322 Nat osp1 F 757 TRUE FALSE FALSE
28 1 NonFumatrici 39 2600 480 350 Nat osp1 F 262 FALSE TRUE FALSE
25 1 NonFumatrici 41 2950 515 333 Ces osp2 M 460 FALSE TRUE FALSE
24 0 NonFumatrici 38 3920 505 333 Nat osp3 M 2010 FALSE TRUE FALSE
25 1 Fumatrici 42 2980 490 335 Nat osp2 M 758 TRUE FALSE FALSE
26 0 NonFumatrici 40 2660 500 343 Nat osp1 F 2396 FALSE TRUE FALSE
34 1 Fumatrici 40 3310 490 330 Nat osp3 M 1118 TRUE FALSE FALSE
28 0 Fumatrici 39 3310 500 355 Ces osp1 M 1718 TRUE FALSE FALSE
35 3 Fumatrici 38 2750 475 332 Nat osp1 F 1670 TRUE FALSE FALSE
34 0 NonFumatrici 41 3940 500 340 Nat osp1 F 623 FALSE TRUE FALSE
24 0 NonFumatrici 38 2920 505 351 Ces osp3 M 455 FALSE TRUE FALSE
34 1 NonFumatrici 40 2580 495 325 Nat osp3 F 1210 FALSE TRUE FALSE
25 6 NonFumatrici 33 2230 430 313 Nat osp3 F 2357 TRUE FALSE FALSE
30 8 NonFumatrici 40 3850 518 340 Nat osp3 F 204 TRUE FALSE FALSE
39 1 Fumatrici 39 2770 470 342 Nat osp3 F 1574 TRUE FALSE FALSE
27 0 NonFumatrici 32 1550 410 289 Nat osp1 F 445 TRUE FALSE FALSE
28 0 NonFumatrici 38 3720 500 330 Nat osp3 F 1517 FALSE TRUE FALSE
29 0 Fumatrici 37 2440 460 315 Nat osp1 F 658 TRUE FALSE FALSE
27 2 Fumatrici 39 2880 490 325 Ces osp1 M 2359 TRUE FALSE FALSE
26 0 NonFumatrici 40 3000 520 340 Ces osp3 M 1137 FALSE TRUE FALSE
36 0 NonFumatrici 31 1180 355 270 Nat osp3 F 2112 TRUE FALSE FALSE
32 1 Fumatrici 33 1780 400 305 Ces osp1 F 2087 TRUE FALSE FALSE
30 1 Fumatrici 37 4050 530 350 Nat osp1 M 2243 TRUE FALSE FALSE
35 5 Fumatrici 40 4300 530 360 Nat osp3 M 2044 TRUE FALSE FALSE
25 0 NonFumatrici 40 2990 515 350 Ces osp1 F 2183 FALSE TRUE FALSE
19 0 Fumatrici 40 3380 500 361 Nat osp2 M 1441 TRUE FALSE FALSE
35 0 NonFumatrici 32 1780 420 277 Ces osp1 F 1807 TRUE FALSE FALSE
27 1 NonFumatrici 40 3890 495 330 Nat osp2 M 2368 FALSE TRUE FALSE
28 1 NonFumatrici 39 3920 485 345 Nat osp2 F 850 FALSE TRUE FALSE
25 0 NonFumatrici 28 830 310 254 Nat osp1 F 928 TRUE FALSE FALSE
27 0 Fumatrici 39 2990 485 350 Nat osp1 F 1931 TRUE FALSE FALSE
31 1 NonFumatrici 38 3920 490 345 Nat osp2 M 1942 FALSE TRUE FALSE
25 0 NonFumatrici 41 2210 430 310 Nat osp3 F 956 TRUE FALSE FALSE
34 0 NonFumatrici 39 2690 495 335 Nat osp1 M 2149 FALSE TRUE FALSE
31 1 NonFumatrici 38 2670 500 336 Ces osp2 F 980 FALSE TRUE FALSE
25 0 Fumatrici 40 3860 520 345 Nat osp1 M 703 TRUE FALSE FALSE
39 3 NonFumatrici 30 1300 380 276 Nat osp1 M 2147 TRUE FALSE FALSE
21 0 NonFumatrici 40 4050 500 340 Nat osp3 M 1583 FALSE TRUE FALSE
31 1 Fumatrici 40 3310 490 334 Ces osp2 M 2457 TRUE FALSE FALSE
29 1 NonFumatrici 39 3800 490 333 Ces osp3 M 90 FALSE TRUE FALSE
30 2 NonFumatrici 29 1190 360 272 Nat osp2 F 805 TRUE FALSE FALSE
30 1 Fumatrici 39 3100 485 325 Nat osp3 F 2200 TRUE FALSE FALSE
26 1 NonFumatrici 39 2850 500 347 Ces osp1 M 2422 FALSE TRUE FALSE
25 1 NonFumatrici 39 2720 495 345 Nat osp3 F 361 FALSE TRUE FALSE
23 1 NonFumatrici 39 4100 505 350 Nat osp1 F 1206 FALSE TRUE FALSE
34 1 Fumatrici 39 2910 490 325 Nat osp2 M 932 TRUE FALSE FALSE
31 1 Fumatrici 40 2850 485 330 Nat osp1 F 1110 TRUE FALSE FALSE
38 12 NonFumatrici 39 3350 490 344 Nat osp2 M 1218 TRUE FALSE FALSE
24 3 Fumatrici 40 3020 490 338 Nat osp2 F 1325 TRUE FALSE FALSE
36 8 NonFumatrici 36 2860 460 334 Nat osp2 F 1725 TRUE FALSE FALSE
37 2 NonFumatrici 31 1690 405 290 Nat osp2 M 2406 TRUE FALSE FALSE
19 0 Fumatrici 39 2710 470 331 Nat osp3 F 321 TRUE FALSE FALSE
16 1 NonFumatrici 31 1900 440 300 Nat osp2 F 587 TRUE FALSE FALSE
33 11 NonFumatrici 43 3400 475 360 Nat osp1 M 1130 TRUE FALSE FALSE
21 1 Fumatrici 40 3550 510 335 Ces osp3 M 1413 TRUE FALSE FALSE
32 6 NonFumatrici 39 3290 480 360 Ces osp2 M 1409 TRUE FALSE FALSE
27 0 NonFumatrici 31 1800 430 308 Ces osp3 M 1684 TRUE FALSE FALSE
23 0 Fumatrici 39 3100 500 325 Nat osp2 M 1367 TRUE FALSE FALSE
27 1 Fumatrici 37 3110 485 348 Nat osp2 F 182 TRUE FALSE FALSE
32 0 Fumatrici 40 3510 500 346 Nat osp2 M 1281 TRUE FALSE FALSE
30 1 NonFumatrici 33 1770 410 275 Nat osp3 M 1091 TRUE FALSE FALSE
36 6 NonFumatrici 40 4340 515 383 Nat osp1 M 1320 TRUE FALSE FALSE
20 0 NonFumatrici 41 2280 450 280 Ces osp3 M 151 TRUE FALSE FALSE
25 0 Fumatrici 41 3640 510 350 Nat osp3 M 572 TRUE FALSE FALSE
27 0 Fumatrici 40 3870 530 340 Nat osp1 M 668 TRUE FALSE FALSE
24 1 Fumatrici 39 2830 485 315 Ces osp1 M 1017 TRUE FALSE FALSE
26 1 Fumatrici 37 3140 490 335 Nat osp2 F 2236 TRUE FALSE FALSE
36 4 Fumatrici 40 3480 495 347 Nat osp1 M 1226 TRUE FALSE FALSE
33 6 NonFumatrici 37 2900 465 350 Ces osp2 F 2146 TRUE FALSE FALSE
36 2 Fumatrici 40 3040 470 345 Nat osp1 M 2463 TRUE FALSE FALSE
28 2 Fumatrici 41 3860 520 353 Nat osp2 M 1174 TRUE FALSE FALSE
32 1 Fumatrici 40 3170 500 330 Ces osp2 M 593 TRUE FALSE FALSE
31 0 NonFumatrici 31 1730 430 300 Nat osp3 F 2456 TRUE FALSE FALSE
33 0 Fumatrici 40 2940 470 352 Nat osp2 F 1408 TRUE FALSE FALSE
31 1 Fumatrici 39 4020 530 356 Nat osp3 F 1535 TRUE FALSE FALSE
24 5 NonFumatrici 42 3600 510 335 Ces osp3 M 985 TRUE FALSE FALSE
28 1 Fumatrici 40 3270 480 363 Nat osp1 M 538 TRUE FALSE FALSE
29 1 Fumatrici 39 3110 500 327 Nat osp1 F 2235 TRUE FALSE FALSE
35 9 NonFumatrici 37 3150 490 335 Nat osp2 M 1779 TRUE FALSE FALSE
33 1 Fumatrici 40 3860 525 346 Ces osp1 M 2448 TRUE FALSE FALSE
25 1 Fumatrici 37 2740 475 321 Nat osp2 F 1787 TRUE FALSE FALSE
21 0 Fumatrici 40 3050 495 322 Ces osp3 F 1510 TRUE FALSE FALSE
38 2 Fumatrici 38 2970 465 349 Nat osp3 M 1478 TRUE FALSE FALSE
19 0 Fumatrici 40 3320 500 335 Nat osp1 M 279 TRUE FALSE FALSE
24 1 Fumatrici 38 2850 465 340 Ces osp2 F 1471 TRUE FALSE FALSE
23 0 Fumatrici 38 3310 495 340 Ces osp3 M 306 TRUE FALSE FALSE
34 3 NonFumatrici 32 1615 390 297 Nat osp3 F 947 TRUE FALSE FALSE
20 0 Fumatrici 39 2920 490 320 Nat osp2 F 699 TRUE FALSE FALSE
36 8 NonFumatrici 39 3610 500 351 Ces osp1 M 89 TRUE FALSE FALSE
24 0 Fumatrici 40 3130 490 342 Ces osp2 F 2254 TRUE FALSE FALSE
33 1 Fumatrici 37 2570 460 320 Nat osp2 F 1378 TRUE FALSE FALSE
32 1 Fumatrici 37 3040 490 330 Nat osp2 F 1447 TRUE FALSE FALSE
25 3 Fumatrici 40 3250 495 334 Nat osp1 M 2268 TRUE FALSE FALSE
26 1 Fumatrici 39 3080 485 340 Nat osp2 F 1270 TRUE FALSE FALSE
20 0 Fumatrici 36 2650 470 312 Ces osp3 M 1927 TRUE FALSE FALSE
21 1 Fumatrici 39 3250 500 336 Nat osp2 F 1423 TRUE FALSE FALSE
34 2 Fumatrici 40 3380 485 361 Nat osp2 M 2099 TRUE FALSE FALSE
#Plot for diagnosis of detected values
plot_diagnosis <- function(data, xvar, yvar = "Peso", groupvar = NULL) {
  p <- ggplot(data, aes_string(x = xvar, y = yvar)) +
    geom_point(alpha = 0.5) +
    geom_point(data=filter(data, HighLeverage), color="red", size=3) +
    geom_point(data=filter(data, Outlier), shape=1, color="blue", size=3) +
    geom_point(data=filter(data, Influential), shape=4, color="black", size=3) +
    geom_smooth(method = "loess")+
    theme_classic()
    
  if (!is.null(groupvar)) {
    p <- p + facet_wrap(as.formula(paste0("~", groupvar)))
  }
  
  p <- p + 
    theme_minimal() +
    labs(
      title = paste("Influential points highlited on", yvar, "vs", xvar),
      subtitle = "Red: HighLeverage, Blue: Outlier, Black: Influential",
      x = xvar,
      y = yvar
    )
  
  return(p)
}

vars_to_plot <- c("Gestazione", "Lunghezza", "N.gravidanze", "Cranio")

plots_list <- lapply(vars_to_plot, function(var) {
  # Add grouping  only if variables is Gestazione or Lunghezza
  if (var %in% c("Gestazione", "Lunghezza")) {
    group_var <- "Fumatrici"
  } else {
    group_var <- "Sesso"
  }
  plot_diagnosis(final_influential, xvar = var, yvar = "Peso", groupvar = group_var)
})

#Plots for each regressor
print(plots_list[[1]])

print(plots_list[[2]])

print(plots_list[[3]])

print(plots_list[[4]])

#Plot of residuals vs fitted
df_resid <- data.frame(
  Fitted = fitted(mod7_cent),
  Residuals = resid(mod7_cent),
  StdResid = rstandard(mod7_cent),
  Index = 1:length(resid(mod7_cent))
)

#Residuals vs fitted
ggplot(df_resid, aes(x = Fitted, y = Residuals)) +
  geom_point(alpha=0.5) +
  geom_hline(yintercept = 0, linetype="dashed") +
  geom_smooth(method = "loess", se=FALSE, color="red") +
  labs(title = "Residuals vs predicted values", x = "Predicted values", y = "Residuals") +
  theme_classic()

#Calculation of R^2 and RMSE of the best model
R_squared <- summary(mod7_cent)$r.squared
R_squared <- round(R_squared, 4)

#Calculation of RMSE of the best model
predicted <- predict(mod7_cent)
actual <- log(data$Peso)
RMSE <- sqrt(mean((actual - predicted)^2))
RMSE <- round(RMSE, 4)

#Table of R^2 and RMSE of the best model
df_res <- data.frame(R_squared, RMSE)
knitr::kable(df_res, caption = "Table of R^2 and RMSE of the best model")
Table of R^2 and RMSE of the best model
R_squared RMSE
0.7903 0.0839

Key observations from the diagnostic evaluation of the final model:

Prediction and results

This section demonstrates how to use the model to make predictions of the newborn’s weight based on the model parameters, highlighting how the model performs inference to find the best estimate for the target variable.

#Prediction with some example values
predict_weight <- function(N_gravidanze, Fumatrici, Gestazione, Lunghezza, Cranio, Sesso, model, mean_gest, mean_lung) {
  
  #Control of factorial levels
  if(!Fumatrici %in% levels(model$model$Fumatrici)) {
    stop("Invalid value for Fumatrici")
  }
  if(!Sesso %in% unique(model$model$Sesso)) {
    stop("Invalid value for Sesso")
  }
  
  #Centering of variable
  Gestazione_c <- as.double(Gestazione - mean_gest)
  Gestazione_c2 <- as.double(Gestazione_c^2)
  Lunghezza_c <- as.double(Lunghezza - mean_lung)
  
  #Creation of matrix 1x1 as the same class of variables in model
  Gestazione_c <- matrix(Gestazione_c, nrow = 1, ncol = 1)
  Gestazione_c2 <- matrix(Gestazione_c2, nrow = 1, ncol = 1)
  Lunghezza_c <- matrix(Lunghezza_c, nrow = 1, ncol = 1)
  
  #New data
  newdata <- data.frame(
    N.gravidanze = as.numeric(N_gravidanze),
    Fumatrici = factor(Fumatrici, levels = levels(model$model$Fumatrici)),
    Cranio = as.numeric(Cranio),
    Sesso = as.character(Sesso)
  )
  
  #Add matrix for variables Gestazione_c, Gestazione_c2 and Lunghezza_c
  newdata$Gestazione_c <- I(Gestazione_c)
  newdata$Gestazione_c2 <- I(Gestazione_c2)
  newdata$Lunghezza_c <- I(Lunghezza_c)
  
  #Logaritmic prediction
  log_pred <- predict(model, newdata = newdata)
  
  #Return of transformed prediction
  weight_pred <- exp(log_pred)
  
  return(weight_pred)
}

#Mean calculation out of the function for centering variables
mean_gest <- mean(data$Gestazione, na.rm = TRUE)
mean_lung <- mean(data$Lunghezza, na.rm = TRUE)

predicted_weight <- predict_weight(
  N_gravidanze = 3,
  Fumatrici = "NonFumatrici",
  Gestazione = 39,
  Lunghezza = 490, 
  Cranio = 337,
  Sesso = "F",
  model = mod7_cent,
  mean_gest = mean_gest,
  mean_lung = mean_lung
)

Predicted birth weight: 3173 grams

The provided function can be used to generate different prediction examples by modifying the variable values in each call. In this case, considering a mother in her third pregnancy, non-smoker, with 39 weeks of gestation and a female newborn measuring 490 mm in length and 337 mm in head circumference, the model estimates a birth weight of approximately 3200 g.

No external validation or formal cross-validation of the predictive model has been conducted. However, these practices represent a fundamental step to evaluate the generalizability and robustness of the model on independent data or in different clinical settings. Their implementation would allow confirmation of the reliability of the estimates in different populations and further enhance confidence in the clinical use of the model.

Future analyses would be even more reliable by including external validation on independent datasets or cross-validation to assess and, if necessary, optimize predictive performance with a view to transferability and broader practical application.

Visualizations

This section presents the main results of the model, interpreted within the relevant clinical context, visualizing:

#Function to plot relationship with regressors and target variable
create_plots <- function(predict_func, model, mean_gest, mean_lung, fixed_vals_list) {
  
  #Create levels for categorical variables
  fumatrici_levels <- levels(model$model$Fumatrici)
  sesso_levels <- levels(model$model$Sesso)
  
  #Plot 1: Peso vs Gestazione and Fumatrici
  gest_seq <- seq(25, 45, 0.5)
  
  df1 <- expand.grid(Gestazione = gest_seq, Fumatrici = fumatrici_levels)
  df1$Predicted <- mapply(function(g,f) {
    predict_func(N_gravidanze = fixed_vals_list$p1$N_gravidanze,
                 Fumatrici = f,
                 Gestazione = g,
                 Lunghezza = fixed_vals_list$p1$Lunghezza,
                 Cranio = fixed_vals_list$p1$Cranio,
                 Sesso = fixed_vals_list$p1$Sesso,
                 model = model,
                 mean_gest = mean_gest,
                 mean_lung = mean_lung)
  }, df1$Gestazione, df1$Fumatrici)
  
  p1 <- ggplot(df1, aes(x=Gestazione, y=Predicted, color=Fumatrici)) +
    geom_point() + 
    geom_smooth(method = "loess")+
    labs(title="Pred. Weight vs Gestazione and Fumatrici") +
    theme_classic()
  
  #Plot 2: Peso vs Lunghezza and Fumatrici
  lung_seq <- seq(300, 560, 0.5)
  df2 <- expand.grid(Lunghezza = lung_seq, Fumatrici = fumatrici_levels)
  df2$Predicted <- mapply(function(l,f) {
    pred <- predict_func(N_gravidanze = fixed_vals_list$p2$N_gravidanze,
                 Fumatrici = f,
                 Gestazione = fixed_vals_list$p2$Gestazione,
                 Lunghezza = l,
                 Cranio = fixed_vals_list$p2$Cranio,
                 Sesso = fixed_vals_list$p2$Sesso,
                 model = model,
                 mean_gest = mean_gest,
                 mean_lung = mean_lung)
  }, df2$Lunghezza, df2$Fumatrici)
  
  p2 <- ggplot(df2, aes(x=Lunghezza, y=Predicted, color=Fumatrici)) +
    geom_point() + 
    geom_smooth(method = "loess")+
    labs(title="Pred.Weight vs Lunghezza and Fumatrici") +
    theme_classic()
 
  #Plot 3: Peso vs N.Gravidanze and Sesso
  ng_seq <- seq(0, 10, 1)
  df3 <- expand.grid(N_gravidanze = ng_seq, Sesso = sesso_levels)
  df3$Predicted <- mapply(function(ng,s) {
    predict_func(N_gravidanze = ng,
                 Fumatrici = fixed_vals_list$p3$Fumatrici,
                 Gestazione = fixed_vals_list$p3$Gestazione,
                 Lunghezza = fixed_vals_list$p3$Lunghezza,
                 Cranio = fixed_vals_list$p3$Cranio,
                 Sesso = s,
                 model = model,
                 mean_gest = mean_gest,
                 mean_lung = mean_lung)
  }, df3$N_gravidanze, df3$Sesso)
  
  p3 <- ggplot(df3, aes(x=N_gravidanze, y=Predicted, color=Sesso)) +
    geom_point() + 
    geom_smooth(method = "loess")+
    labs(title="Pred. Weight vs N.gravidanze and Sesso") +
    theme_classic()
  
  #Plot 4: Peso vs Cranio and Sesso
  c_seq <- seq(230, 400, 0.5)
  df4 <- expand.grid(Cranio = c_seq, Sesso = sesso_levels)
  df4$Predicted <- mapply(function(c,s) {
    predict_func(N_gravidanze = fixed_vals_list$p4$N_gravidanze,
                 Fumatrici = fixed_vals_list$p4$Fumatrici,
                 Gestazione = fixed_vals_list$p4$Gestazione,
                 Lunghezza = fixed_vals_list$p4$Lunghezza,
                 Cranio = c,
                 Sesso = s,
                 model = model,
                 mean_gest = mean_gest,
                 mean_lung = mean_lung)
  }, df4$Cranio, df4$Sesso)
  
  p4 <- ggplot(df4, aes(x=Cranio, y=Predicted, color=Sesso)) +
    geom_point() + 
    geom_smooth(method = "loess")+
    labs(title="Pred. Weight vs Cranio and Sesso") +
    theme_classic()
  
  return(list(p1, p2, p3, p4))
}

#Creation of fixed vals list
fixed_vals_list <- list(
  p1 = list(N_gravidanze=3, Lunghezza=490, Cranio=337, Sesso="F"),
  p2 = list(N_gravidanze=3, Gestazione=39, Cranio=337, Sesso="F"),
  p3 = list(Fumatrici="NonFumatrici", Gestazione=39, Lunghezza=490, Cranio=337),
  p4 = list(N_gravidanze=3, Fumatrici="NonFumatrici", Gestazione=39, Lunghezza=490)
)

#Execution and visualization
plots <- create_plots(predict_weight, mod7_cent, mean_gest, mean_lung, fixed_vals_list)

#Creation of wrap plots
wrap_plots(plots, ncol=2, nrow=2)

The results obtained from the predictive model clearly highlight the most influential maternal and biometric variables affecting newborn weight, providing useful tools for early evaluation of the risk of impaired growth.

Impact of gestation and smoking:

Gestational age is the primary factor determining the progressive increase in fetal weight. However, the distinction between smoking and non-smoking mothers reveals a clear weight gap, with infants of smoking mothers being significantly smaller. This finding confirms the need for targeted preventive actions aimed at smoking cessation during pregnancy to reduce the incidence of intrauterine growth restriction. Moreover, for high-risk cases (e.g., smoking mothers with advanced gestation but low predicted weight), it is essential to schedule closer monitoring and consider early availability of neonatal intensive care.

Role of biometric measurements:

Variables such as length and head circumference represent reliable indicators of overall fetal growth. The linear or near-linear correlations with weight suggest that, when integrated into routine ultrasounds, they can be used to develop personalized weight estimates at term. Such information allows identification of infants potentially at risk of underdevelopment or macrosomia, enabling optimization of clinical decisions regarding delivery and neonatal care.

Influence of maternal factors and sex:

The number of previous pregnancies shows a positive impact on weight, likely due to progressive maternal physiological adaptations. The differentiation by fetal sex, with males generally heavier, underscores the need to consider these characteristics when calculating personalized growth curves. These findings support a differentiated risk assessment for male and female newborns, aiding specific care protocols.

Clinical and operational implications:

#Plot of predicted vs real values of target variable

#Predicted values on original scale (exp of log prediction)
predicted_values <- exp(predict(mod7_cent))

#Create dataframe with index and both observed and predicted weights
df_long <- data.frame(
  Index = 1:nrow(data),
  Observed = data$Peso,
  Predicted = predicted_values
) %>%
  pivot_longer(cols = c("Observed", "Predicted"), 
               names_to = "Type", values_to = "Weight")

#Plot of Weight vs Index, colored by Type (Observed or Predicted)
ggplot(df_long, aes(x = Index, y = Weight, color = Type)) +
  geom_point(alpha = 0.5) +
  scale_color_manual(
    values = c("Observed" = "blue", "Predicted" = "red"),
    labels = c("Observed", "Predicted")
  ) +
  labs(
    title = "Comparison of Observed and Predicted Newborn Weight",
    x = "Observation Index",
    y = "Weight (grams)",
    color = "Data Type"
  ) +
  theme_classic()

The plot shows a good overall correspondence between the observed values (in blue) and the predicted values (in red) from the model across the entire series of observations. It can be seen that the model explains well the variability of most newborn weights within the central range of the distribution.

For extreme values, both at the lower and upper ends of the weight distribution, a greater discrepancy between observed and predicted values is observed. This phenomenon is consistent with the presence of heteroscedasticity highlighted by diagnostic tests, that is, a variation in the dispersion of prediction errors that increases with the expected value.

Moreover, the presence of biologically plausible outliers (such as very small newborns due to prematurity or very large newborns in cases of macrosomia or multiple pregnancies) makes it more difficult for the model to provide precise estimates in these extreme ranges. These cases represent natural clinical variability and not measurement errors, and their influence contributes to greater uncertainty in the predictions.

The practical implications to consider are as follows:

#Function to plot residuals of model vs groups of categorical variables of model
std_res_plot <- function(data, x, y) {
  ggplot(data, aes(x = {{x}}, y = {{y}}, fill = {{x}})) +
    geom_boxplot() +
    labs(
      title = paste("Standardized Residuals by", deparse(substitute(x))),
      x = deparse(substitute(x)),
      y = deparse(substitute(y))
    ) +
    theme_classic()
}

#Add residuals of model
data$residuals_std <- rstandard(mod7_cent)

#Plot of results
p1 <- std_res_plot(data, Fumatrici, residuals_std)
p2 <- std_res_plot(data, Sesso, residuals_std)

p1 + p2 + plot_layout(ncol = 2)

The analysis of residuals stratified by groups confirms that the model does not introduce systematic bias related to the considered categorical variables (smoking status and sex). The distribution of prediction errors is similar across groups, with medians close to zero and comparable spread of residual variability. The presence of some outliers, evenly distributed among categories, indicates that difficult-to-predict cases are not associated with specific groups but rather reflect natural biological variability. These results support what has been previously discussed, namely that the model performs well on most data and that larger discrepancies mainly occur in correspondence with rare and anomalous but biologically plausible values.

Conclusions

The analysis conducted with the predictive model allows for a reliable estimation of neonatal weight in most cases, providing an objective support for the early assessment of the risk of altered fetal growth. The model highlights the importance of maternal and biometric variables, such as gestational age, smoking status, infant sex, length, and head circumference, in determining the estimated weight.

The residual analysis stratified by smoking status and sex groups showed no systematic bias, confirming that the model maintains good predictive equity across these categories. However, greater uncertainty emerged in the estimation for extreme weight values, i.e., in rare but biologically plausible cases. This uncertainty requires the qualified intervention of clinical expertise, which must validate and integrate the estimates especially in cases of very low or very high predicted weight, or in the presence of significant maternal risk factors.

At the operational level, particular attention is recommended for:

In terms of model improvement, the systematic collection of longitudinal fetal growth data and the inclusion of additional relevant clinical variables could increase the predictive accuracy and robustness of the model. The collection of repeated biometric parameters over time, as well as more detailed information on maternal conditions, would facilitate the development of dynamic and personalized models, with a potential positive impact on prenatal care pathways.