Statistical model for newborn weight prediction

Project introduction

The project aims to develop a statistical model to accurately predict newborn birth weight using clinical data from three hospitals. This initiative seeks to improve the management of high-risk pregnancies, optimize hospital resources, and promote better neonatal outcomes, within a context of growing attention to the prevention of neonatal complications. Predicting neonatal birth weight is essential for optimal care planning and reducing risks associated with preterm births or low birth weight infants.

The main benefits of the project are:

Support for timely clinical interventions through more accurate weight predictions.
Optimization of resource use by early identification of newborns requiring intensive care.
Identification of risk factors affecting neonatal weight, enabling personalized management of high-risk pregnancies.
Comparative evaluation of clinical practices across hospitals to improve care quality and consistency.
Strategic and healthcare decision support to reduce neonatal mortality and morbidity.

#Load dataset
data <- read.csv("C:/Users/matmi/OneDrive/Desktop/ProgettiR/neonati.csv")

#First 10 rows of dataset
kable(head(data, 10), caption = "First 10 rows of dataset")

First 10 rows of dataset
Anni.madre	N.gravidanze	Gestazione	Peso	Lunghezza	Cranio	Tipo.parto	Ospedale	Sesso
26	0	42	3380	490	325	Nat	osp3	M
21	2	39	3150	490	345	Nat	osp1	F
34	3	38	3640	500	375	Nat	osp2	M
28	1	41	3690	515	365	Nat	osp2	M
20	0	38	3700	480	335	Nat	osp3	F
32	0	40	3200	495	340	Nat	osp2	F
26	1	39	3100	480	345	Nat	osp3	F
25	0	40	3580	510	349	Nat	osp1	M
22	1	40	3670	500	335	Ces	osp2	F
23	0	41	3700	510	362	Ces	osp2	F

#Summary of dataset
kable(summary(data), caption = "Statistical index summary of dataset")

Statistical index summary of dataset
Anni.madre	N.gravidanze	Fumatrici	Gestazione	Peso	Lunghezza	Cranio	Tipo.parto	Ospedale	Sesso
Min. : 0.00	Min. : 0.0000	Min. :0.0000	Min. :25.00	Min. : 830	Min. :310.0	Min. :235	Length:2500	Length:2500	Length:2500
1st Qu.:25.00	1st Qu.: 0.0000	1st Qu.:0.0000	1st Qu.:38.00	1st Qu.:2990	1st Qu.:480.0	1st Qu.:330	Class :character	Class :character	Class :character
Median :28.00	Median : 1.0000	Median :0.0000	Median :39.00	Median :3300	Median :500.0	Median :340	Mode :character	Mode :character	Mode :character
Mean :28.16	Mean : 0.9812	Mean :0.0416	Mean :38.98	Mean :3284	Mean :494.7	Mean :340	NA	NA	NA
3rd Qu.:32.00	3rd Qu.: 1.0000	3rd Qu.:0.0000	3rd Qu.:40.00	3rd Qu.:3620	3rd Qu.:510.0	3rd Qu.:350	NA	NA	NA
Max. :46.00	Max. :12.0000	Max. :1.0000	Max. :43.00	Max. :4930	Max. :565.0	Max. :390	NA	NA	NA

attach(data)

To build the predictive model, data were collected on 2,500 newborns from three hospitals. The collected variables include:

Mother’s age: Measured in years.
Number of pregnancies: The total number of pregnancies the mother has had.
Maternal smoking: A binary indicator (0 = non-smoker, 1 = smoker).
Duration of pregnancy: Number of weeks of gestation.
Newborn weight: Birth weight in grams.
Length and head diameter: Newborn length and cranial diameter, which can also be measured during pregnancy via ultrasounds.
Type of delivery: Natural or cesarean.
Birth hospital: Hospital 1, 2, or 3.
Newborn sex: Male (M) or female (F).

The dataset shows the following key characteristics:

Average maternal age of 28.16 years, with anomalous values of 0 that require checking, and the majority between 25 and 32 years, a typical reproductive age range.
Average number of pregnancies around 1, with a rare extreme case of 12 pregnancies that needs verification.
Average gestation duration of about 39 weeks, with a minimum of 25 weeks indicating prematurity.
Average newborn weight of approximately 3,284 grams, with a wide range from 830 grams (premature or low birth weight) to 4,930 grams (larger newborns).
Average length of about 495 mm and average head circumference of 340 mm, with natural variation within the sample.

Descriptive analysis of dataset

The dataset analysis is structured into the following main phases:

Calculation of key statistical indicators for each numerical variable (mean, median, standard deviation, skewness, and kurtosis).
Assessment of normality using the Shapiro-Wilk test, accompanied by visual analyses to evaluate its reliability in large samples.
Graphical comparison of variable distributions with the corresponding theoretical normal distribution.
Q-Q plot analysis to evaluate the alignment between sample and theoretical quantiles.
Identification and analysis of outliers using boxplots with upper and lower bounds.
Examination of numerical variables in relation to categorical variables in the dataset.
Study of correlations among variables.

Statistical index calculation

#Function for statistical index calculation of each variable
calc_stats <- function(x) {
  c(
    mean = mean(x, na.rm = TRUE),
    median = median(x, na.rm = TRUE),
    sd = sd(x, na.rm = TRUE),
    skewness = moments::skewness(x, na.rm = TRUE),
    kurtosis = moments::kurtosis(x, na.rm = TRUE)
  )
}

#Matrix of statistical index for each numeric variable of dataset
numeric_vars <- sapply(data, is.numeric)
numeric_vars <- numeric_vars & sapply(data, function(x) length(unique(na.omit(x))) > 2)
stat_summary <- sapply(data[, numeric_vars], calc_stats)
kable(stat_summary, caption = "Matrix of statistical index for each numeric variable of dataset")

Matrix of statistical index for each numeric variable of dataset
	Anni.madre	N.gravidanze	Gestazione	Peso	Lunghezza	Cranio
mean	28.1640000	0.981200	38.980400	3284.0808000	494.692000	340.0292000
median	28.0000000	1.000000	39.000000	3300.0000000	500.000000	340.0000000
sd	5.2735783	1.280587	1.868639	525.0387442	26.318644	16.4253299
skewness	0.0428115	2.514254	-2.065313	-0.6470308	-1.514699	-0.7850527
kurtosis	3.3804165	13.989406	11.258150	5.0315318	9.487174	5.9462063

The calculated statistics show the following situation for each variable:

Mother’s age: Distribution close to normal, with a slight heavy tail; it can be initially considered normal.
Number of pregnancies: Non-normal distribution, strongly skewed with a high peak; far from normal.
Gestation duration: Non-normal distribution, left-skewed with heavy tails; cannot be considered normal.
Birth weight: Approximately normal distribution but with negative skewness and heavy tails; moderately far from normal.
Length: Non-normal distribution, skewed towards lower values with heavy tails; far from normal.
Head circumference: Slightly skewed distribution with moderate tails; can be considered approximately normal with caution.

Shapiro-Wilk test for numerical variables

This phase aims to assess whether the distributions of the numerical variables in the dataset approximate a normal distribution, a fundamental prerequisite for the use of many parametric statistical techniques, such as the t-test or linear regression. To this end, statistical tests (Shapiro-Wilk) and visual methods, such as density plots compared to the theoretical normal distribution and Q-Q plots (quantile-quantile), were employed. These allow identification of any asymmetries, heavy tails, or deviations from normality.

The results of this analysis will guide the selection of the most appropriate statistical tools and models in the subsequent phases of the project.

#Function for Shapiro-Wilk test
shapiro_summary <- function(x, alpha = 0.05) {
  x <- na.omit(x)
 
  if (length(x) < 3) {
    return(NA)
  }
  if (length(x) > 5000) {
    x <- sample(x, 5000)
  }
  test <- shapiro.test(x)
  p_value <- test$p.value
  result <- ifelse(p_value > alpha, "Fail to reject H0 (normality)", "Reject H0 (not normal)")
  
  return(c(p_value = p_value, conclusion = result))
}

shapiro_results <- lapply(data[, numeric_vars], shapiro_summary)

shapiro_df <- do.call(rbind, lapply(shapiro_results, function(x) {
  data.frame(p_value = x["p_value"], conclusion = x["conclusion"])
}))

rownames(shapiro_df) <- names(shapiro_results)

kable(shapiro_df, caption = "Results of Shapiro-Wilk test on each variable of dataset", row.names = TRUE)

Results of Shapiro-Wilk test on each variable of dataset
	p_value	conclusion
Anni.madre	1.63894746545117e-09	Reject H0 (not normal)
N.gravidanze	8.89524555042765e-54	Reject H0 (not normal)
Gestazione	2.6437822902771e-45	Reject H0 (not normal)
Peso	3.2344586185273e-22	Reject H0 (not normal)
Lunghezza	3.18605170830064e-36	Reject H0 (not normal)
Cranio	1.26291692019115e-24	Reject H0 (not normal)

The Shapiro-Wilk test indicates that no variable is perfectly normal, but in large samples even small deviations lead to rejecting the normality hypothesis. However, in large samples, parametric tests are generally robust to moderate departures from normality, especially when supported by graphical analyses (to be conducted in the following phases) and by considering the practical relevance of the deviations observed. This strategy allows making the most of the effectiveness and sensitivity of parametric methods to draw solid conclusions, while remaining attentive to potential limitations caused by more pronounced deviations.

Numerical variable distribution visual analysis

#Function for density plot
plot_density <- function(x, main_title = deparse(substitute(x))) {
  x <- na.omit(x)
  
  m <- mean(x)
  s <- sd(x)
  
  df <- data.frame(value = x)
  
  x_seq <- seq(min(x), max(x), length.out = 1000)
  norm_density <- dnorm(x_seq, mean = m, sd = s)
  
  norm_df <- data.frame(x = x_seq, y = norm_density)
  
  ggplot(df, aes(x = value)) +
    geom_density(color = "black", fill = "gray80") +
    geom_line(data = norm_df, aes(x = x, y = y), color = "red", size = 1) +
    labs(title = paste("Density plot of", main_title),
         x = "Value",
         y = "Density") +
    theme_classic() +
    theme(plot.title = element_text(hjust = 0.5))
}

#Variables for plot
vars_to_plot <- data[, numeric_vars]

plots_list <- lapply(names(vars_to_plot), function(var_name) {
  plot_density(vars_to_plot[[var_name]], main_title = var_name)
})

#Plot of each variable distribution vs normal distribution with same mean and standard deviation
combined_plot <- wrap_plots(plots_list, ncol = 2, nrow = 3)
print(combined_plot)

The visual analysis of the density plots confirms the main characteristics highlighted by the statistical indicators:

Maternal age and birth weight show approximately symmetric and near-normal distributions.
Number of pregnancies has a highly skewed distribution toward low values.
Gestation duration presents a slight skewness toward lower weeks, consistent with the presence of prematurity.
Length and head circumference are largely symmetric, with only a slight skewness in length.

Numerical variable Q-Q plot analysis

#Function for Q-Q plot
qq_plot <- function(x, main_title = deparse(substitute(x))) {
  x <- na.omit(x)
  
  df <- data.frame(value = x)
  
  ggplot(df, aes(sample = value)) +
    stat_qq(color = "blue")+
    stat_qq_line(color = "red")+
    labs(title = paste("Q-Q plot of", main_title),
          x = "Theoretical quantiles",
          y = "Sample quantiles") +
    theme_classic()+
    theme(plot.title = element_text(hjust = 0.5))
}

qq_list <- lapply(names(vars_to_plot), function(var_name) {
  qq_plot(vars_to_plot[[var_name]], main_title = var_name)
})

combined_plot <- wrap_plots(qq_list, ncol = 2, nrow = 3)
print(combined_plot)

The results of the Q-Q plots, which display the quantile-quantile distribution of the variables compared to theoretical normality, confirm the previous observations regarding the closeness of some variables to normality and the more marked deviations of others, especially in the tails and asymmetries.

Numerical variable outlier analysis

In this phase, an outlier analysis was performed using the interquartile range (IQR) rule to identify anomalous values that could affect the quality of the analysis. Clinically relevant data, such as cases of prematurity or multiple pregnancies, were retained, while only values deemed implausible were removed to ensure the reliability and representativeness of the sample.

#Function for boxplot
box_plot <- function(x, main_title = deparse(substitute(x))) {
  x <- na.omit(x)
  
  df <- data.frame(value = x)
  
  ggplot(df, aes(x = "", y = value)) +
    geom_boxplot(fill = "lightblue", color = "darkblue") +
    labs(title = paste("Boxplot of", main_title),
         y = main_title,
         x = "") +
    theme_classic() +
    theme(axis.text.x = element_blank(),
          axis.ticks.x = element_blank(),
          plot.title = element_text(hjust = 0.5))
}

box_list <- lapply(names(vars_to_plot), function(var_name) {
  box_plot(vars_to_plot[[var_name]], main_title = var_name)
})

combined_plot <- wrap_plots(box_list, ncol = 3, nrow = 2)
print(combined_plot)

#Function to calculate outliers
calc_outliers <- function(x) {
  Q1 <- quantile(x, 0.25, na.rm = TRUE)
  Q3 <- quantile(x, 0.75, na.rm = TRUE)
  IQR <- Q3 - Q1
  lower_bound <- Q1 - 1.5 * IQR
  upper_bound <- Q3 + 1.5 * IQR
  which_outliers <- which(x < lower_bound | x > upper_bound)
  
  if(length(which_outliers) == 0) {
    return(NULL)
  }
  
  n_outliers <- length(which_outliers)  
  total <- sum(!is.na(x))
  percent_outliers <- 100 * n_outliers / total 
  
  conclusion <- ifelse(percent_outliers > 5, "High number of outliers", "Acceptable number of outliers")
  
  return(list(n_outliers = n_outliers, percent_outliers = percent_outliers, conclusion = conclusion))
}

outliers_list <- lapply(data[, numeric_vars], calc_outliers)
outliers_df <- bind_rows(lapply(outliers_list, as.data.frame), .id = "Variable")

kable(outliers_df, caption = "Table of number, percentage and conclusion for outliers of each numerical variable")

Table of number, percentage and conclusion for outliers of each numerical variable
Variable	n_outliers	percent_outliers	conclusion
Anni.madre	13	0.52	Acceptable number of outliers
N.gravidanze	246	9.84	High number of outliers
Gestazione	67	2.68	Acceptable number of outliers
Peso	69	2.76	Acceptable number of outliers
Lunghezza	59	2.36	Acceptable number of outliers
Cranio	48	1.92	Acceptable number of outliers

The outlier analysis shows that the number of pregnancies has a significant proportion of extreme values (9.84%) that could influence the model, while mother’s age has few outliers likely due to errors that need correction. The other variables exhibit few anomalies, mainly attributable to clinically valid data (e.g., prematurity), which should be retained. Overall, the data are of good quality and suitable for modeling, with particular attention to managing outliers in “Number of pregnancies” and “Mother’s age.”

#Function to find and remove error in variable Anni.Madre
remove_rows_condition <- function(data, var, condition_func) {
  rows_to_remove <- which(condition_func(data[[var]]))
  data_filtered <- data[-rows_to_remove, ]
  return(data_filtered)
}

data <- remove_rows_condition(data, "Anni.madre", function(x) x < 10)

#Summary of dataset after removing error in variable Anni.Madre
kable(summary(data), caption = "Summary of dataset after removing error in variable Anni.Madre")

Summary of dataset after removing error in variable Anni.Madre
Anni.madre	N.gravidanze	Fumatrici	Gestazione	Peso	Lunghezza	Cranio	Tipo.parto	Ospedale	Sesso
Min. :13.00	Min. : 0.0000	Min. :0.00000	Min. :25.00	Min. : 830	Min. :310.0	Min. :235	Length:2498	Length:2498	Length:2498
1st Qu.:25.00	1st Qu.: 0.0000	1st Qu.:0.00000	1st Qu.:38.00	1st Qu.:2990	1st Qu.:480.0	1st Qu.:330	Class :character	Class :character	Class :character
Median :28.00	Median : 1.0000	Median :0.00000	Median :39.00	Median :3300	Median :500.0	Median :340	Mode :character	Mode :character	Mode :character
Mean :28.19	Mean : 0.9816	Mean :0.04163	Mean :38.98	Mean :3284	Mean :494.7	Mean :340	NA	NA	NA
3rd Qu.:32.00	3rd Qu.: 1.0000	3rd Qu.:0.00000	3rd Qu.:40.00	3rd Qu.:3620	3rd Qu.:510.0	3rd Qu.:350	NA	NA	NA
Max. :46.00	Max. :12.0000	Max. :1.00000	Max. :43.00	Max. :4930	Max. :565.0	Max. :390	NA	NA	NA

An exploratory analysis was conducted by identifying and removing evident errors, such as maternal ages below biologically plausible values, eliminating only a few observations without compromising the overall quality of the dataset. Outliers related to clinical variables such as weight, length, head circumference, and gestation duration were deemed plausible and therefore retained, as were the rare high values in the “N.gravidanze” variable. However, considering the potential influence of these extreme values on the model, it might be useful to further investigate the analysis by applying robust statistical methods or variable transformations in the future, while carefully assessing whether these approaches lead to meaningful improvements without overly complicating the model’s interpretability and parsimony.

Multivariate analysis of dataset and statistical tests

Below, a multivariate analysis of the dataset variables is conducted. This analysis aims to examine the distributions of numerical variables in relation to the corresponding levels of categorical variables, in order to identify differences between groups, through:

Calculation of the means of each numerical variable for all groups, along with their differences.
Statistical tests and boxplot visualizations to determine, for each numerical variable, whether the distributions differ significantly among groups.
Focus on specific considerations, such as: assessing if there is a difference in the number of cesarean deliveries among hospitals, verifying whether the mean birth weight and length significantly differ from those of the general population, and checking if anthropometric measurements differ significantly between sexes.

Mean calculation beetwen groups and relative differences

#Function to calculate mean aggregated by other variables
calc_aggregate <- function(data, num_var, cat_var) {
  agg <- data %>%
    group_by(across(all_of(cat_var))) %>%
    summarise(
      mean = mean(.data[[num_var]], na.rm = TRUE),
      .groups = "drop"
    )%>%
    pivot_longer(cols = mean, names_to = "Stat", values_to = "Value") %>%
    pivot_wider(names_from = !!sym(cat_var), values_from = Value) %>%
    mutate(Variable = num_var, .before = 1)
  
  #Extract only mean
  means_row <- agg %>% dplyr::select(-Variable, -Stat)
  
  group_names <- colnames(means_row)
  means_values <- as.numeric(means_row[1, ])
  
  #Function to calculate mean differences between pairs of values for each categorical variable
  comb_pairs <- combn(seq_along(means_values), 2, simplify = FALSE)
  
  diff_cols <- map(comb_pairs, function(idx) {
      i <- idx[1]
      j <- idx[2]
      diff_val <- means_values[j] - means_values[i]
      col_name <- paste0("diff_", group_names[j], "_minus_", group_names[i])
      
      diff_col <- rep(NA_real_, nrow(agg))
      media_row_index <- which(agg$Stat == "mean")
      diff_col[media_row_index] <- diff_val
    
      list(name = col_name, values = diff_col)
  })
  
  for (c in diff_cols) {
    agg[[c$name]] <- c$values
  }
  
  return(agg)
}

#Definition of categorical variables
categorical_vars <- c("Fumatrici", "Tipo.parto", "Ospedale", "Sesso")
numerical_vars <- names(data)[numeric_vars]

results_list <- list()

#For statement to save all dataset (aggregated index by each categorical variable)
for (cat_var in categorical_vars) {
  agg_tables <- lapply(numerical_vars, function(num_var) {
    df_agg <- calc_aggregate(data, num_var, cat_var)
  })
  results_list[[cat_var]] <- bind_rows(agg_tables)
}

#Print of each calculated table
kable(results_list[["Fumatrici"]], caption = "Aggregated stats by Fumatrici")

Aggregated stats by Fumatrici
Variable	Stat	0	1	diff_1_minus_0
Anni.madre	mean	28.1804511	28.317308	0.1368566
N.gravidanze	mean	0.9690894	1.269231	0.3001414
Gestazione	mean	38.9670008	39.269231	0.3022299
Peso	mean	3286.2623225	3236.346154	-49.9161686
Lunghezza	mean	494.8099415	492.067308	-2.7426338
Cranio	mean	340.0588972	339.346154	-0.7127434

kable(results_list[["Tipo.parto"]], caption = "Aggregated stats by Tipo.parto")

Aggregated stats by Tipo.parto
Variable	Stat	Ces	Nat	diff_Nat_minus_Ces
Anni.madre	mean	28.201923	28.1796610	-0.0222621
N.gravidanze	mean	1.023352	0.9644068	-0.0589449
Gestazione	mean	39.023352	38.9615819	-0.0617697
Peso	mean	3282.046703	3285.0632768	3.0165735
Lunghezza	mean	496.365385	494.0090395	-2.3563451
Cranio	mean	340.012363	340.0361582	0.0237956

kable(results_list[["Ospedale"]], caption = "Aggregated stats by Ospedale")

Aggregated stats by Ospedale
Variable	Stat	osp1	osp2	osp3	diff_osp2_minus_osp1	diff_osp3_minus_osp1	diff_osp3_minus_osp2
Anni.madre	mean	28.045343	28.1037736	28.407674	0.0584304	0.3623307	0.3039003
N.gravidanze	mean	0.995098	0.9245283	1.026379	-0.0705697	0.0312809	0.1018506
Gestazione	mean	38.944853	38.9693396	39.023981	0.0244867	0.0791279	0.0546412
Peso	mean	3270.265931	3270.5070755	3311.708633	0.2411441	41.4427017	41.2015576
Lunghezza	mean	494.120098	495.3396226	494.604316	1.2195246	0.4842185	-0.7353061
Cranio	mean	339.927696	339.8856132	340.274580	-0.0420829	0.3468843	0.3889671

kable(results_list[["Sesso"]], caption = "Aggregated stats by Sesso")

Aggregated stats by Sesso
Variable	Stat	F	M	diff_M_minus_F
Anni.madre	mean	28.1370518	28.235720	0.0986682
N.gravidanze	mean	0.9498008	1.013677	0.0638758
Gestazione	mean	38.7314741	39.230089	0.4986144
Peso	mean	3161.0613546	3408.495575	247.4342206
Lunghezza	mean	489.7641434	499.674980	9.9108365
Cranio	mean	337.6231076	342.458568	4.8354604

Statistical tests to investigate differences beetwen groups (focus on Peso and Lunghezza beetwen Sesso)

#Function to explore differences beetwen groups for each numerical variable
data$Fumatrici <- factor(data$Fumatrici, levels = c(0, 1), labels = c("NonFumatrici", "Fumatrici"))
categorical_for_test <- c("Fumatrici", "Tipo.parto", "Ospedale", "Sesso")

test_group_diff <- function(data, num_var, cat_var) {
  x <- data[[num_var]]
  g <- data[[cat_var]]
  
  df <- data.frame(x = x, g = g) %>% filter(!is.na(x), !is.na(g))
  
  n_groups <- length(unique(df$g))
  
  #Normality test for each group
  normality <- by(df$x, df$g, function(v) shapiro.test(v)$p.value)
  normal <- all(unlist(normality) > 0.05)
  
  #Omogeneity variance test (Levene)
  levene_p <- car::leveneTest(x ~ g, data = df)$"Pr(>F)"[1]
  variance_homogeneous <- (levene_p > 0.05)
  
  if (n_groups == 2) {
    if (normal && variance_homogeneous) {
      test <- t.test(x ~ g, data = df, var.equal = TRUE)
      test_name <- "t-test (equal variance)"
      nparam_test_name <- NA_character_
      nparam_pvalue <- NA_real_
    } else {
      test <- t.test(x ~ g, data = df, var.equal = FALSE)
      test_name <- "Welch t-test (different variance)"
      nparam_test_name <- "Wilcoxon rank-sum test"
      wtest <- wilcox.test(x ~ g, data = df)
      nparam_pvalue <- wtest$p.value
    }
  } else if (n_groups > 2) {
    if (normal && variance_homogeneous) {
      test <- aov(x ~ g, data = df)
      test_name <- "ANOVA (equal variance)"
      nparam_test_name <- NA_character_
      nparam_pvalue <- NA_real_
    } else {
      test <- oneway.test(x ~ g, data = df, var.equal = FALSE)
      test_name <- "Welch ANOVA (different variance)"
      nparam_test_name <- "Kruskal-Wallis test"
      kwtest <- kruskal.test(x ~ g, data = df)
      nparam_pvalue <- kwtest$p.value
    }
  } else {
    return(NA)
  }
  
  if (test_name == "ANOVA (equal variance)") {
    param_pvalue <- summary(test)[[1]][["Pr(>F)"]][1]
  } else {
    param_pvalue <- test$p.value
  }
  
  return(list(
    test = test_name,
    nparam_test = nparam_test_name,
    param_pvalue = param_pvalue,
    nparam_pvalue = nparam_pvalue
  ))
}

#Return of test results
results_tests <- expand.grid(num_var = numerical_vars, cat_var = categorical_vars, stringsAsFactors = FALSE)

results_tests <- results_tests %>%
  rowwise() %>%
  mutate(test_result = list(test_group_diff(data, num_var, cat_var))) %>%
  mutate(
  test = test_result$test,
  param_pvalue = test_result$param_pvalue,
  param_significance = case_when(
    param_pvalue < 0.001 ~ "***",
    param_pvalue < 0.01  ~ "**",
    param_pvalue < 0.05  ~ "*",
    TRUE                 ~ ""
  ),
  nparam_test = test_result$nparam_test,
  nparam_pvalue = test_result$nparam_pvalue,
  nparam_significance = case_when(
    nparam_pvalue < 0.001 ~ "***",
    nparam_pvalue < 0.01  ~ "**",
    nparam_pvalue < 0.05  ~ "*",
    TRUE                  ~ ""
    ),
  p_value_concordance = ifelse((param_pvalue < 0.05 & nparam_pvalue < 0.05) |
         (param_pvalue >= 0.05 & nparam_pvalue >= 0.05), "P-value concordance", "P-value discordance")
  ) %>%
  ungroup() %>%
  dplyr::select(num_var, cat_var, test, param_pvalue, param_significance, nparam_test, nparam_pvalue, nparam_significance, p_value_concordance)

#Print of results
kable(
  results_tests,
  caption = "Statistical test results for differences beetwen groups",
  col.names = c("Numerical variable", "Categorical variable", "Parametric test", "P-value pametric test", "Significance parametric test", "Not parametric test", "P-value not parametric test", "Significance not parametric test", "P-Value concordance")
)

Statistical test results for differences beetwen groups
Numerical variable	Categorical variable	Parametric test	P-value pametric test	Significance parametric test	Not parametric test	P-value not parametric test	Significance not parametric test	P-Value concordance
Anni.madre	Fumatrici	Welch t-test (different variance)	0.8016453		Wilcoxon rank-sum test	0.8716883		P-value concordance
N.gravidanze	Fumatrici	Welch t-test (different variance)	0.0188038	*	Wilcoxon rank-sum test	0.0022427	**	P-value concordance
Gestazione	Fumatrici	Welch t-test (different variance)	0.0388965	*	Wilcoxon rank-sum test	0.2098755		P-value discordance
Peso	Fumatrici	Welch t-test (different variance)	0.3022785		Wilcoxon rank-sum test	0.0592839		P-value concordance
Lunghezza	Fumatrici	Welch t-test (different variance)	0.2100567		Wilcoxon rank-sum test	0.0373755	*	P-value discordance
Cranio	Fumatrici	Welch t-test (different variance)	0.6149153		Wilcoxon rank-sum test	0.3392392		P-value concordance
Anni.madre	Tipo.parto	Welch t-test (different variance)	0.9240376		Wilcoxon rank-sum test	0.9481223		P-value concordance
N.gravidanze	Tipo.parto	Welch t-test (different variance)	0.2881163		Wilcoxon rank-sum test	0.1196188		P-value concordance
Gestazione	Tipo.parto	Welch t-test (different variance)	0.4343654		Wilcoxon rank-sum test	0.9322109		P-value concordance
Peso	Tipo.parto	Welch t-test (different variance)	0.8916349		Wilcoxon rank-sum test	0.5243601		P-value concordance
Lunghezza	Tipo.parto	Welch t-test (different variance)	0.0326031	*	Wilcoxon rank-sum test	0.0879934		P-value discordance
Cranio	Tipo.parto	Welch t-test (different variance)	0.9728416		Wilcoxon rank-sum test	0.6432892		P-value concordance
Anni.madre	Ospedale	Welch ANOVA (different variance)	0.3091713		Kruskal-Wallis test	0.2952401		P-value concordance
N.gravidanze	Ospedale	Welch ANOVA (different variance)	0.2472083		Kruskal-Wallis test	0.0661108		P-value concordance
Gestazione	Ospedale	Welch ANOVA (different variance)	0.6924804		Kruskal-Wallis test	0.2977338		P-value concordance
Peso	Ospedale	Welch ANOVA (different variance)	0.1749530		Kruskal-Wallis test	0.2164421		P-value concordance
Lunghezza	Ospedale	Welch ANOVA (different variance)	0.6356001		Kruskal-Wallis test	0.7351307		P-value concordance
Cranio	Ospedale	Welch ANOVA (different variance)	0.8647781		Kruskal-Wallis test	0.9283076		P-value concordance
Anni.madre	Sesso	Welch t-test (different variance)	0.6366427		Wilcoxon rank-sum test	0.7286129		P-value concordance
N.gravidanze	Sesso	Welch t-test (different variance)	0.2129632		Wilcoxon rank-sum test	0.2907222		P-value concordance
Gestazione	Sesso	Welch t-test (different variance)	0.0000000	***	Wilcoxon rank-sum test	0.0000000	***	P-value concordance
Peso	Sesso	Welch t-test (different variance)	0.0000000	***	Wilcoxon rank-sum test	0.0000000	***	P-value concordance
Lunghezza	Sesso	Welch t-test (different variance)	0.0000000	***	Wilcoxon rank-sum test	0.0000000	***	P-value concordance
Cranio	Sesso	Welch t-test (different variance)	0.0000000	***	Wilcoxon rank-sum test	0.0000000	***	P-value concordance

The objective of the analysis is to identify significant differences between groups in numerical variables by selecting appropriate statistical tests based on the data distributions. Although normality tests indicate deviations, many variables show approximately normal distributions with slight skewness and outliers. For this reason, standard parametric tests are used when normality and homogeneity of variances assumptions are met, supplemented by robust tests (Welch) in cases of heteroscedasticity, and by non-parametric tests (Wilcoxon and Kruskal-Wallis) when parametric assumptions are violated.

The combined application of these methods allows for reliable detection of significant differences between groups, such as the effect of maternal smoking on the number of pregnancies and newborn size, and differences related to sex on gestational duration and anthropometric characteristics.

Specifically, as explicitly requested, it is verified that anthropometric measurements differ significantly between the two sexes. This finding is confirmed by both parametric and non-parametric tests.

These results will be further illustrated in the following boxplots.

#Function for boxplot
box_plot_grouped <- function(data, num_var, group_var, main_title = NULL) {
  df <- data %>%
    dplyr::select(all_of(c(num_var, group_var))) %>%
    filter(!is.na(.data[[num_var]]), !is.na(.data[[group_var]]))
  
  if (is.null(main_title)) {
    main_title <- paste("Boxplot of", num_var, "by", group_var)
  }
  
  ggplot(df, aes_string(x = group_var, y = num_var, fill = group_var)) +
    geom_boxplot() +
    labs(title = main_title, x = group_var, y = num_var) +
    theme_classic() +
    theme(plot.title = element_text(hjust = 0.5))
}

categorical_vars <- c("Fumatrici", "Tipo.parto", "Ospedale", "Sesso")

#Boxplot by group of categorical variables
plots_by_group <- lapply(categorical_vars, function(cat_var) {
  p_list <- lapply(numerical_vars, function(num_var) {
    box_plot_grouped(data, num_var, cat_var)
  })
  wrap_plots(p_list, ncol = 2) 
})

names(plots_by_group) <- categorical_vars

#Plot of each boxplot
print(plots_by_group[["Fumatrici"]])

print(plots_by_group[["Tipo.parto"]])

print(plots_by_group[["Ospedale"]])

print(plots_by_group[["Sesso"]])

The visual analysis through boxplots supports the statistical results, highlighting significant differences in gestational weeks and newborn length between smokers and non-smokers, as well as pronounced differences by newborn sex in gestational age, weight, length, and head circumference. No significant differences emerge for delivery type or hospital.

Statistical test for differences in cesarean deliveries among hospitals

As a further analysis, a chi-squared test of independence is conducted to verify whether the distribution of delivery type (natural or cesarean) significantly differs among the three hospitals in the dataset. This test allows the evaluation of whether an association exists between the healthcare facility and the mode of delivery.

calc_tab_cont <- function(var1, var2) {
  cont_tab <- table(var1, var2)
  return(cont_tab)
}

cont_tab <- calc_tab_cont(data$Tipo.parto, data$Ospedale)

kable(cont_tab, caption = "Contingency table for variables Tipo.parto and Ospedale")

Contingency table for variables Tipo.parto and Ospedale
	osp1	osp2	osp3
Ces	242	254	232
Nat	574	594	602

result_t1 <- chisq.test(cont_tab)
test_ces_df <- data.frame(
  Method = result_t1$method,
  Statistic = round(as.numeric(result_t1$statistic), 4),
  P_Value = round(result_t1$p.value, 4)
)

kable(test_ces_df, caption = "Result of chi-squared test on on the contingency table between Tipo.parto and Ospedale")

Result of chi-squared test on on the contingency table between Tipo.parto and Ospedale
Method	Statistic	P_Value
Pearson’s Chi-squared test	1.083	0.5819

The statistical analysis using the chi-squared test on the contingency table between ‘Tipo.parto’ and ‘Ospedale’ revealed a p-value of 0.578, which is above the conventional significance threshold of 0.05. This indicates that there are no significant differences in the distribution of delivery types (natural or cesarean) among the three hospitals considered. Therefore, in the studied sample, the place of delivery does not appear to influence the frequency of cesarean births.

Statistical test for differences of Peso and Lunghezza from population

Moreover, a statistical verification is conducted to assess the difference between the sample means of newborn weight and length and their respective reference values (3300 grams for weight and 500 mm for length). These reference values are taken from the 2024 Cedap report of the Italian Ministry of Health and are derived from the data collected in the homonymous information flows that each Health Organization is required to produce monthly. To this end, two one-sample tests are employed: the Student’s t-test, based on the assumption of normality, and the Wilcoxon test, which is non-parametric and more robust to deviations from the normality assumption.

#Function to test Peso and Lunghezza then same population variable
compare_to_population <- function(x, mu, alpha = 0.05) {
  x <- na.omit(x)
  
  #T test
  t_res <- t.test(x, mu = mu)
  
  #Wilcoxon test
  wilcox_res <- wilcox.test(x, mu = mu)
  
  #Summary of results
  result <- data.frame(
      parametric_test = "One-sample t-test",
      p_value_ptest = t_res$p.value,
      conclusion_ptest = ifelse(t_res$p.value < alpha,
                          "Reject H0. Target variable is statistically different from population variable",
                          "Fail to Reject H0. Target variable is not statistically different from population variable"),
      nparametric_test = "Wilcoxon signed-rank test",
      p_value_nptest = wilcox_res$p.value,
      conclusion_nptest = ifelse(wilcox_res$p.value < alpha,
                          "Reject H0. Target variable is statistically different from population variable",
                          "Fail to Reject H0. Target variable is not statistically different from population variable"),
      stringsAsFactors = FALSE    
    )
  
    return(result)
 }

#Test for weight and length
weight_ref <- 3300
length_ref <- 500

weight_test <- compare_to_population(data$Peso, mu = weight_ref)
length_test <- compare_to_population(data$Lunghezza, mu = length_ref)

kable(weight_test, caption = "Summary table of results for compare beetwen Peso of sample and weight of population",
      col.names = c("Parametric test", "P-value parametric test", "Conclusion parametric test", "Not parametric test", "P-value not parametric test", "Conclusion not parametric test"))

Summary table of results for compare beetwen Peso of sample and weight of population
Parametric test	P-value parametric test	Conclusion parametric test	Not parametric test	P-value not parametric test	Conclusion not parametric test
One-sample t-test	0.1324476	Fail to Reject H0. Target variable is not statistically different from population variable	Wilcoxon signed-rank test	0.9482417	Fail to Reject H0. Target variable is not statistically different from population variable

kable(length_test, caption = "Summary table of results for compare beetwen Lunghezza of sample and length of population",
      col.names = c("Parametric test", "P-value parametric test", "Conclusion parametric test", "Not parametric test", "P-value not parametric test", "Conclusion not parametric test"))

Summary table of results for compare beetwen Lunghezza of sample and length of population
Parametric test	P-value parametric test	Conclusion parametric test	Not parametric test	P-value not parametric test	Conclusion not parametric test
One-sample t-test	0	Reject H0. Target variable is statistically different from population variable	Wilcoxon signed-rank test	0	Reject H0. Target variable is statistically different from population variable

Both parametric and non-parametric tests confirm that the average newborn weight in the sample does not differ significantly from the reference value of 3300 grams, indicating a good correspondence with the reference population. Conversely, for length, both tests detect a significant difference compared to the 500 mm reference, indicating that the average length in the sample deviates from the expected value. The consistency between the tests is supported by the approximate normality of the distributions, which justifies the use of parametric tests and strengthens the validity of the conclusions through the consistent results from the non-parametric tests.

Correlation analysis

In this phase, correlation analyses are performed among the dataset variables to identify any linear relationships between them. First, the global relationship among the numerical variables in the dataset is analyzed, followed by the relationship between numerical variables within each group defined by the values of the categorical variables.

#Select only numerical variables
num_data <- data[, numeric_vars]

#Correlation matrix for numerical variable of dataset
ggpairs(
  num_data,
  lower = list(continuous = wrap("smooth_loess", alpha = 0.3, size = 0.5)),
  upper = list(continuous = wrap("cor", size = 5)),
  diag = list(continuous = wrap("densityDiag"))
) +
  theme_classic() +
  ggtitle("Correlation matrix between numerical variable of dataset") +
  theme(axis.text.x = element_text(angle = 60, hjust = 1))

#Function to plot correlation matrix among groups
plot_grouped_ggpairs <- function(data, cat_var, numeric_vars) {
  
  groups <- unique(data[[cat_var]])
  
  plots_list <- lapply(groups, function(g) {
    df_subset <- data[data[[cat_var]] == g, ]
    
    p <- ggpairs(
      df_subset,
      columns = which(names(data) %in% numeric_vars),
      lower = list(continuous = wrap("smooth_loess", alpha = 0.3)),
      upper = list(continuous = wrap("cor", size = 5)),
      diag = list(continuous = wrap("densityDiag"))
    ) + ggtitle(paste("Correlation matrix for", cat_var, "=", g))+
  theme_classic() +
  theme(axis.text.x = element_text(angle = 60, hjust = 1))
    
    return(p)
  })
  
  names(plots_list) <- groups
  return(plots_list)
}

numeric_var_names <- names(data)[numeric_vars]

#Plot of all plot for each categorical variable
plots_all_categorical <- lapply(categorical_vars, function(cat_var) {
  plot_grouped_ggpairs(data, cat_var, numeric_var_names)
})

#Plot of correlation matrix for variable 'Fumatrici'
print(plots_all_categorical[[1]])

#Plot of correlation matrix for variable 'Tipo.parto'
print(plots_all_categorical[[2]])

#Plot of correlation matrix for variable 'Ospedale'
print(plots_all_categorical[[3]])

#Plot of correlation matrix for variable 'Sesso'
print(plots_all_categorical[[4]])

The descriptive and correlation analysis highlighted that variables related to gestation and neonatal anthropometric measurements are strongly associated with newborn weight and represent key potential predictors. Categorical variables such as maternal smoking, delivery type, newborn sex, and hospital facility influence these relationships and their inclusion in the predictive model should be evaluated.

In particular, the attenuated effects observed in the subgroup of smoking mothers suggest considering interaction terms or differential effects in the model to better capture the dynamics present in the sample. Furthermore, the strong correlation among some anthropometric measurements requires appropriate handling of multicollinearity.

Therefore, the multiple regression model to be developed should include both relevant continuous and categorical variables, assess possible interactions, and implement suitable strategies to address multicollinearity and outliers, in order to ensure reliable and interpretable estimates for the prediction of newborn weight.

Regression model creation

In this phase of the project, a multiple linear regression model is developed to quantify the effect of the available variables on newborn birth weight. To this end, a procedure has been implemented that systematically presents the main results of the model, including:

Coefficient estimates that indicate the impact of each variable on newborn weight.
Multicollinearity analysis using the Variance Inflation Factor (VIF), which identifies if some variables are highly correlated with each other, potentially compromising the reliability of the estimates.
Key diagnostic tests to verify the assumptions of the regression model, such as: heteroscedasticity, meaning the constancy of error variance, autocorrelation, referring to the independence of errors over time or the order of observations and the normality of residuals, which is important to ensure the validity of statistical inferences.

#Creation of regression model with backward approach

#Function for model results
mod_results <- function(mod) {
  
  summ <- summary(mod)
  vif_values <- car::vif(mod)
  bp_test <- lmtest::bptest(mod)
  dw_test <- lmtest::dwtest(mod)
  shapiro_test <- shapiro.test(residuals(mod))
  
  #Coefficients table with p-value
  coefs <- as.data.frame(summ$coefficients)
  coefs <- round(coefs, 4)
  rownames(coefs) <- rownames(summ$coefficients)
  colnames(coefs) <- c("Estimate", "Std. Error", "t value", "Pr(>|t|)")
  
  print(kable(coefs, caption = "Coefficients of the Regression Model")%>%
          kableExtra::kable_styling(full_width = FALSE, position = "left"))
  
  #VIF table
  if (is.matrix(vif_values)) {
    vif_tab <- data.frame(
      Variable = rownames(vif_values),
      GVIF = round(vif_values[,"GVIF"], 3),
      GVIF_corr = round(vif_values[,"GVIF^(1/(2*Df))"], 3)
    )
  } else {
    vif_tab <- data.frame(
      Variable = names(vif_values),
      VIF = round(vif_values, 3)
    )
  }
  
  print(kable(vif_tab, caption = "Variance Inflation Factor (VIF)")%>%
          kableExtra::kable_styling(full_width = FALSE, position = "left"))
  
  #Diagnostic test table
  diag_tab <- data.frame(
    Test = c("Breusch-Pagan (Heteroscedasticity)",
             "Durbin-Watson (Autocorrelation)",
             "Shapiro-Wilk (Normality of Residuals)"),
    Statistic = c(round(bp_test$statistic, 3),
                  round(dw_test$statistic, 3),
                  round(shapiro_test$statistic, 3)),
    p_value = c(format.pval(bp_test$p.value, digits = 3, eps = 1e-3),
                format.pval(dw_test$p.value, digits = 3, eps = 1e-3),
                format.pval(shapiro_test$p.value, digits = 3, eps = 1e-3)),
    stringsAsFactors = FALSE
  )
  
  print(kable(diag_tab, caption = "Diagnostic Tests Summary")%>%
          kableExtra::kable_styling(full_width = FALSE, position = "left"))
  
  #Residuals plots
  par(mfrow = c(2,2))
  plot(mod)
  
  invisible(list(summary = summ,
                 vif = vif_values,
                 bptest = bp_test,
                 dwtest = dw_test,
                 shapiro = shapiro_test))
}

#Transform also other categorical variables in factor
data$Sesso <- factor(data$Sesso, levels = c("M", "F"))
data$Tipo.parto <- factor(data$Tipo.parto, levels = c("Nat", "Ces"))
data$Ospedale <- factor(data$Ospedale, levels = c("osp1", "osp2", "osp3"))

#Model1: all variables
mod1 <- lm(Peso ~., data = data)
mod_results(mod1)

Coefficients of the Regression Model
	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	-6628.5902	143.0078	-46.3512	0.0000
Anni.madre	0.8018	1.1467	0.6992	0.4845
N.gravidanze	11.3812	4.6686	2.4378	0.0148
FumatriciFumatrici	-30.2741	27.5492	-1.0989	0.2719
Gestazione	32.5773	3.8208	8.5264	0.0000
Lunghezza	10.2922	0.3009	34.2070	0.0000
Cranio	10.4722	0.4263	24.5671	0.0000
Tipo.partoCes	-29.6335	12.0905	-2.4510	0.0143
Ospedaleosp2	-11.0912	13.4471	-0.8248	0.4096
Ospedaleosp3	28.2495	13.5054	2.0917	0.0366
SessoF	-77.5723	11.1865	-6.9344	0.0000

Variance Inflation Factor (VIF)
	Variable	GVIF	GVIF_corr
Anni.madre	Anni.madre	1.190	1.091
N.gravidanze	N.gravidanze	1.189	1.091
Fumatrici	Fumatrici	1.007	1.004
Gestazione	Gestazione	1.696	1.302
Lunghezza	Lunghezza	2.087	1.445
Cranio	Cranio	1.631	1.277
Tipo.parto	Tipo.parto	1.004	1.002
Ospedale	Ospedale	1.004	1.001
Sesso	Sesso	1.041	1.020

Diagnostic Tests Summary
	Test	Statistic	p_value
BP	Breusch-Pagan (Heteroscedasticity)	93.294	<0.001
DW	Durbin-Watson (Autocorrelation)	1.952	0.114
W	Shapiro-Wilk (Normality of Residuals)	0.974	<0.001

In the initial full model, which includes all available variables, certain features such as gestational age, length, and head circumference are strongly associated with birth weight, highlighting their key role in neonatal growth.

The number of previous pregnancies shows a positive but moderate effect, suggesting that maternal reproductive history may influence the baby’s weight.

Male sex is associated with a higher birth weight, consistent with established biological knowledge.

Conversely, variables such as the hospital of birth and the type of delivery, although included in the initial model, show an overall weak and non-significant effect. This suggests that local differences between hospital groups or delivery methods may have a limited impact on the overall model.

Best model selection

In this phase, starting from the model containing all variables in the dataset, a backward selection approach is adopted to identify the most parsimonious and adequate multiple regression model. The process unfolds through the following steps:

Progressive elimination of less significant variables: through sequential model updates, variables with low estimated relevance are removed and test with ANOVA.
Transformation of the response variable: the weight variable is log-transformed to attenuate heteroscedasticity issues highlighted in diagnostic analyses.
Inclusion of interactions between variables: interactions between smokers and variables such as length and gestation duration are evaluated, based on dataset exploration and literature evidence.
Centering of continuous variables: to reduce multicollinearity among linear and interaction terms, length and gestation variables were centered prior to inclusion in the model.
Incorporation of nonlinear terms: quadratic and nonlinear interaction terms are introduced to capture possible more complex dynamics among variables.
Model evaluation: the models are compared using Akaike Information Criterion (AIC) and Bayesian Information Criterion (BIC), selecting the model that balances better fit and parsimony.

#Function for ANOVA, AIC and BIC test for models
compare_models <- function(mod_prev, mod_new) {
  #Extract terms of two models
  terms_prev <- attr(terms(mod_prev), "term.labels")
  terms_new <- attr(terms(mod_new), "term.labels")
  
  #Verify if the new model is nested
  is_nested <- all(terms_new %in% terms_prev)
  
  #Calculate AIC and BIC
  aic_tab <- AIC(mod_prev, mod_new)
  bic_tab <- BIC(mod_prev, mod_new)
  
  print(knitr::kable(aic_tab, caption = "Method AIC for best model chose") %>%
          kableExtra::kable_styling(full_width = FALSE, position = "left"))
  print(knitr::kable(bic_tab, caption = "Method BIC for best model chose") %>%
          kableExtra::kable_styling(full_width = FALSE, position = "left"))
  
  if (is_nested) {
    cat("Models are nested. Performing ANOVA test.\n")
    anova_res <- anova(mod_prev, mod_new)
    print(knitr::kable(anova_res, caption = "Test ANOVA results") %>%
            kableExtra::kable_styling(full_width = FALSE, position = "left"))
  } else {
    cat("Models are NOT nested. ANOVA test skipped.\n")
  }
}

#Model2: all variables minus Anni.madre (not significant variable)
mod2 <- update(mod1, ~.-Anni.madre)
mod_results(mod2)

Coefficients of the Regression Model
	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	-6601.3083	137.5680	-47.9858	0.0000
N.gravidanze	12.5833	4.3400	2.8994	0.0038
FumatriciFumatrici	-30.4268	27.5455	-1.1046	0.2694
Gestazione	32.2996	3.7997	8.5006	0.0000
Lunghezza	10.2916	0.3008	34.2087	0.0000
Cranio	10.4874	0.4257	24.6375	0.0000
Tipo.partoCes	-29.6654	12.0892	-2.4539	0.0142
Ospedaleosp2	-10.9509	13.4442	-0.8145	0.4154
Ospedaleosp3	28.5171	13.4986	2.1126	0.0347
SessoF	-77.6452	11.1849	-6.9420	0.0000

Variance Inflation Factor (VIF)
	Variable	GVIF	GVIF_corr
N.gravidanze	N.gravidanze	1.028	1.014
Fumatrici	Fumatrici	1.007	1.004
Gestazione	Gestazione	1.677	1.295
Lunghezza	Lunghezza	2.087	1.445
Cranio	Cranio	1.627	1.275
Tipo.parto	Tipo.parto	1.004	1.002
Ospedale	Ospedale	1.003	1.001
Sesso	Sesso	1.041	1.020

Diagnostic Tests Summary
	Test	Statistic	p_value
BP	Breusch-Pagan (Heteroscedasticity)	91.153	<0.001
DW	Durbin-Watson (Autocorrelation)	1.953	0.12
W	Shapiro-Wilk (Normality of Residuals)	0.974	<0.001

compare_models(mod1, mod2)

Method AIC for best model chose
	df	AIC
mod_prev	12	35145.57
mod_new	11	35144.06

Method BIC for best model chose
	df	BIC
mod_prev	12	35215.45
mod_new	11	35208.12

Models are nested. Performing ANOVA test.

Test ANOVA results
Res.Df	RSS	Df	Sum of Sq	F	Pr(>F)
2487	186743194	NA	NA	NA	NA
2488	186779904	-1	-36710	0.4888948	0.4844861

#Model3: all variables minus Ospedale (not significant variable)
mod3 <- update(mod2, ~.-Ospedale)
mod_results(mod3)

Coefficients of the Regression Model
	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	-6600.2816	137.6162	-47.9615	0.0000
N.gravidanze	12.9927	4.3439	2.9910	0.0028
FumatriciFumatrici	-31.8823	27.5803	-1.1560	0.2478
Gestazione	32.5970	3.8039	8.5693	0.0000
Lunghezza	10.2684	0.3011	34.0985	0.0000
Cranio	10.5015	0.4262	24.6372	0.0000
Tipo.partoCes	-30.4244	12.1041	-2.5136	0.0120
SessoF	-78.1031	11.1998	-6.9736	0.0000

Variance Inflation Factor (VIF)
	Variable	VIF
N.gravidanze	N.gravidanze	1.027
Fumatrici	Fumatrici	1.007
Gestazione	Gestazione	1.676
Lunghezza	Lunghezza	2.085
Cranio	Cranio	1.626
Tipo.parto	Tipo.parto	1.004
Sesso	Sesso	1.040

Diagnostic Tests Summary
	Test	Statistic	p_value
BP	Breusch-Pagan (Heteroscedasticity)	90.431	<0.001
DW	Durbin-Watson (Autocorrelation)	1.954	0.124
W	Shapiro-Wilk (Normality of Residuals)	0.974	<0.001

compare_models(mod2, mod3)

Method AIC for best model chose
	df	AIC
mod_prev	11	35144.06
mod_new	9	35149.33

Method BIC for best model chose
	df	BIC
mod_prev	11	35208.12
mod_new	9	35201.73

Models are nested. Performing ANOVA test.

Test ANOVA results
Res.Df	RSS	Df	Sum of Sq	F	Pr(>F)
2488	186779904	NA	NA	NA	NA
2490	187473818	-2	-693913.8	4.621636	0.0099213

#Model4: log(Peso) for manage eteroschedasticity shown in residuals plot of mod3
mod4 <- lm(log(Peso) ~ N.gravidanze + Fumatrici + Gestazione + Lunghezza + 
               Cranio + Tipo.parto + Sesso, data = data)
mod_results(mod4)

Coefficients of the Regression Model
	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	4.4198	0.0434	101.8054	0.0000
N.gravidanze	0.0041	0.0014	3.0097	0.0026
FumatriciFumatrici	-0.0065	0.0087	-0.7443	0.4568
Gestazione	0.0186	0.0012	15.5179	0.0000
Lunghezza	0.0035	0.0001	37.2147	0.0000
Cranio	0.0035	0.0001	26.1255	0.0000
Tipo.partoCes	-0.0070	0.0038	-1.8417	0.0656
SessoF	-0.0181	0.0035	-5.1175	0.0000

Variance Inflation Factor (VIF)
	Variable	VIF
N.gravidanze	N.gravidanze	1.027
Fumatrici	Fumatrici	1.007
Gestazione	Gestazione	1.676
Lunghezza	Lunghezza	2.085
Cranio	Cranio	1.626
Tipo.parto	Tipo.parto	1.004
Sesso	Sesso	1.040

Diagnostic Tests Summary
	Test	Statistic	p_value
BP	Breusch-Pagan (Heteroscedasticity)	169.739	<0.001
DW	Durbin-Watson (Autocorrelation)	1.945	0.0848
W	Shapiro-Wilk (Normality of Residuals)	0.974	<0.001

compare_models(mod3, mod4)

Method AIC for best model chose
	df	AIC
mod_prev	9	35149.325
mod_new	9	-5125.624

Method BIC for best model chose
	df	BIC
mod_prev	9	35201.734
mod_new	9	-5073.215

Models are nested. Performing ANOVA test.

Test ANOVA results
	Df	Sum Sq	Mean Sq	F value	Pr(>F)
N.gravidanze	1	3.571804e+03	3.571804e+03	0.0474402	0.8275968
Fumatrici	1	2.516883e+05	2.516883e+05	3.3428881	0.0676156
Gestazione	1	2.450098e+08	2.450098e+08	3254.1845859	0.0000000
Lunghezza	1	2.048849e+08	2.048849e+08	2721.2517384	0.0000000
Cranio	1	4.707085e+07	4.707085e+07	625.1882273	0.0000000
Tipo.parto	1	4.810017e+05	4.810017e+05	6.3885943	0.0115470
Sesso	1	3.661486e+06	3.661486e+06	48.6313237	0.0000000
Residuals	2490	1.874738e+08	7.529069e+04	NA	NA

#Model5: interaction beetwen Fumatrici and Lunghezza + Fumatrici and Gestazione
mod5 <- lm(log(Peso) ~ N.gravidanze + Fumatrici * Lunghezza + Fumatrici * Gestazione + Cranio + Tipo.parto + Sesso, data = data)
mod_results(mod5)

Coefficients of the Regression Model
	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	4.4134	0.0437	101.0835	0.0000
N.gravidanze	0.0043	0.0014	3.1212	0.0018
FumatriciFumatrici	0.3322	0.2556	1.2996	0.1939
Lunghezza	0.0035	0.0001	36.4149	0.0000
Gestazione	0.0193	0.0012	15.8743	0.0000
Cranio	0.0035	0.0001	26.1248	0.0000
Tipo.partoCes	-0.0073	0.0038	-1.9213	0.0548
SessoF	-0.0183	0.0035	-5.1684	0.0000
FumatriciFumatrici:Lunghezza	0.0011	0.0005	2.2917	0.0220
FumatriciFumatrici:Gestazione	-0.0219	0.0070	-3.1418	0.0017

Variance Inflation Factor (VIF)
	Variable	VIF
N.gravidanze	N.gravidanze	1.028
Fumatrici	Fumatrici	872.074
Lunghezza	Lunghezza	2.138
Gestazione	Gestazione	1.730
Cranio	Cranio	1.627
Tipo.parto	Tipo.parto	1.004
Sesso	Sesso	1.043
Fumatrici:Lunghezza	Fumatrici:Lunghezza	695.009
Fumatrici:Gestazione	Fumatrici:Gestazione	1004.897

Diagnostic Tests Summary
	Test	Statistic	p_value
BP	Breusch-Pagan (Heteroscedasticity)	176.993	<0.001
DW	Durbin-Watson (Autocorrelation)	1.949	0.0997
W	Shapiro-Wilk (Normality of Residuals)	0.975	<0.001

compare_models(mod4, mod5)

Method AIC for best model chose
	df	AIC
mod_prev	9	-5125.624
mod_new	11	-5132.234

Method BIC for best model chose
	df	BIC
mod_prev	9	-5073.215
mod_new	11	-5068.178

Models are NOT nested. ANOVA test skipped.

#Model5 centered: centering of Lunghezza and Gestazione to manage multicollinearity
data$Lunghezza_c <- scale(data$Lunghezza, scale=FALSE)
data$Gestazione_c <- scale(data$Gestazione, scale=FALSE)

mod5_cent <- lm(log(Peso) ~ N.gravidanze + Fumatrici * Gestazione_c + Fumatrici*Lunghezza_c + Cranio + Tipo.parto + Sesso, data=data)
mod_results(mod5_cent)

Coefficients of the Regression Model
	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	6.8967	0.0457	150.8700	0.0000
N.gravidanze	0.0043	0.0014	3.1212	0.0018
FumatriciFumatrici	0.0023	0.0091	0.2512	0.8017
Gestazione_c	0.0193	0.0012	15.8743	0.0000
Lunghezza_c	0.0035	0.0001	36.4149	0.0000
Cranio	0.0035	0.0001	26.1248	0.0000
Tipo.partoCes	-0.0073	0.0038	-1.9213	0.0548
SessoF	-0.0183	0.0035	-5.1684	0.0000
FumatriciFumatrici:Gestazione_c	-0.0219	0.0070	-3.1418	0.0017
FumatriciFumatrici:Lunghezza_c	0.0011	0.0005	2.2917	0.0220

Variance Inflation Factor (VIF)
	Variable	VIF
N.gravidanze	N.gravidanze	1.028
Fumatrici	Fumatrici	1.105
Gestazione_c	Gestazione_c	1.730
Lunghezza_c	Lunghezza_c	2.138
Cranio	Cranio	1.627
Tipo.parto	Tipo.parto	1.004
Sesso	Sesso	1.043
Fumatrici:Gestazione_c	Fumatrici:Gestazione_c	1.416
Fumatrici:Lunghezza_c	Fumatrici:Lunghezza_c	1.387

Diagnostic Tests Summary
	Test	Statistic	p_value
BP	Breusch-Pagan (Heteroscedasticity)	176.993	<0.001
DW	Durbin-Watson (Autocorrelation)	1.949	0.0997
W	Shapiro-Wilk (Normality of Residuals)	0.975	<0.001

compare_models(mod5, mod5_cent)

Method AIC for best model chose
	df	AIC
mod_prev	11	-5132.234
mod_new	11	-5132.234

Method BIC for best model chose
	df	BIC
mod_prev	11	-5068.178
mod_new	11	-5068.178

Models are NOT nested. ANOVA test skipped.

#Model6: interaction beetwen Gestazione and Lunghezza
mod6_cent <- lm(log(Peso) ~ N.gravidanze + Fumatrici * Gestazione_c + Fumatrici*Lunghezza_c + Cranio + Sesso + Gestazione_c*Lunghezza_c, data=data)
mod_results(mod6_cent)

Coefficients of the Regression Model
	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	6.9523	0.0448	155.3155	0.0000
N.gravidanze	0.0038	0.0013	2.8658	0.0042
FumatriciFumatrici	-0.0002	0.0089	-0.0178	0.9858
Gestazione_c	0.0132	0.0013	10.1521	0.0000
Lunghezza_c	0.0033	0.0001	34.2855	0.0000
Cranio	0.0034	0.0001	25.5879	0.0000
SessoF	-0.0222	0.0035	-6.4110	0.0000
FumatriciFumatrici:Gestazione_c	-0.0186	0.0068	-2.7279	0.0064
FumatriciFumatrici:Lunghezza_c	0.0012	0.0005	2.6452	0.0082
Gestazione_c:Lunghezza_c	-0.0002	0.0000	-11.7191	0.0000

Variance Inflation Factor (VIF)
	Variable	VIF
N.gravidanze	N.gravidanze	1.028
Fumatrici	Fumatrici	1.106
Gestazione_c	Gestazione_c	2.070
Lunghezza_c	Lunghezza_c	2.220
Cranio	Cranio	1.641
Sesso	Sesso	1.053
Fumatrici:Gestazione_c	Fumatrici:Gestazione_c	1.417
Fumatrici:Lunghezza_c	Fumatrici:Lunghezza_c	1.388
Gestazione_c:Lunghezza_c	Gestazione_c:Lunghezza_c	1.734

Diagnostic Tests Summary
	Test	Statistic	p_value
BP	Breusch-Pagan (Heteroscedasticity)	147.794	<0.001
DW	Durbin-Watson (Autocorrelation)	1.946	0.0879
W	Shapiro-Wilk (Normality of Residuals)	0.975	<0.001

compare_models(mod5_cent, mod6_cent)

Method AIC for best model chose
	df	AIC
mod_prev	11	-5132.234
mod_new	11	-5262.748

Method BIC for best model chose
	df	BIC
mod_prev	11	-5068.178
mod_new	11	-5198.693

Models are NOT nested. ANOVA test skipped.

#Model7: add of non linear term of Gestazione and eliminate interaction beetwen Gestazione and Lunghezza (for reduce multicollinearity)
data$Gestazione_c2 <- data$Gestazione_c^2
data$Lunghezza_c2 <- data$Lunghezza_c^2

mod7_cent <- lm(log(Peso) ~ N.gravidanze + Fumatrici*Gestazione_c + Gestazione_c2 +
              Fumatrici*Lunghezza_c + Lunghezza_c + 
              Cranio + Sesso, data=data)
mod_results(mod7_cent)

Coefficients of the Regression Model
	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	6.9544	0.0447	155.4749	0.0000
N.gravidanze	0.0041	0.0013	3.0477	0.0023
FumatriciFumatrici	-0.0011	0.0089	-0.1238	0.9015
Gestazione_c	0.0117	0.0013	8.7372	0.0000
Gestazione_c2	-0.0024	0.0002	-11.9303	0.0000
Lunghezza_c	0.0033	0.0001	35.1399	0.0000
Cranio	0.0034	0.0001	25.6199	0.0000
SessoF	-0.0216	0.0034	-6.2728	0.0000
FumatriciFumatrici:Gestazione_c	-0.0152	0.0068	-2.2305	0.0258
FumatriciFumatrici:Lunghezza_c	0.0011	0.0005	2.4925	0.0127

Variance Inflation Factor (VIF)
	Variable	VIF
N.gravidanze	N.gravidanze	1.028
Fumatrici	Fumatrici	1.106
Gestazione_c	Gestazione_c	2.228
Gestazione_c2	Gestazione_c2	1.837
Lunghezza_c	Lunghezza_c	2.183
Cranio	Cranio	1.641
Sesso	Sesso	1.051
Fumatrici:Gestazione_c	Fumatrici:Gestazione_c	1.424
Fumatrici:Lunghezza_c	Fumatrici:Lunghezza_c	1.387

Diagnostic Tests Summary
	Test	Statistic	p_value
BP	Breusch-Pagan (Heteroscedasticity)	158.233	<0.001
DW	Durbin-Watson (Autocorrelation)	1.951	0.111
W	Shapiro-Wilk (Normality of Residuals)	0.978	<0.001

compare_models(mod6_cent, mod7_cent)

Method AIC for best model chose
	df	AIC
mod_prev	11	-5262.748
mod_new	11	-5267.497

Method BIC for best model chose
	df	BIC
mod_prev	11	-5198.693
mod_new	11	-5203.442

Models are NOT nested. ANOVA test skipped.

#Resume of best model selection with AIC and BIC
knitr::kable(AIC(mod1, mod2, mod3, mod4, mod5, mod5_cent, mod6_cent, mod7_cent), caption = "Resume of results for method AIC for best model")

Resume of results for method AIC for best model
	df	AIC
mod1	12	35145.571
mod2	11	35144.062
mod3	9	35149.325
mod4	9	-5125.624
mod5	11	-5132.234
mod5_cent	11	-5132.234
mod6_cent	11	-5262.748
mod7_cent	11	-5267.497

knitr::kable(BIC(mod1, mod2, mod3, mod4, mod5, mod5_cent, mod6_cent, mod7_cent), caption = "Resume of results for method BIC for best model")

Resume of results for method BIC for best model
	df	BIC
mod1	12	35215.450
mod2	11	35208.118
mod3	9	35201.734
mod4	9	-5073.215
mod5	11	-5068.178
mod5_cent	11	-5068.178
mod6_cent	11	-5198.693
mod7_cent	11	-5203.442

The final model demonstrates a good balance between simplicity and predictive performance, as highlighted by the AIC, BIC criteria, and coefficient tests. Diagnostic analysis verifies the fundamental assumptions of linear regression and indicates:

Significant heteroscedasticity (Breusch-Pagan test p < 0.001), which justified the logarithmic transformation of the response variable to stabilize variance; this characteristic, besides suggesting the possible adoption of robust methods, may reflect intrinsic biological properties.
Absence of significant autocorrelation in residuals (Durbin-Watson test p ≈ 0.11), indicating independence of errors.
Minor deviations from residual normality (Shapiro-Wilk test p < 0.001, statistically close to 1), allowing reasonable acceptance of the assumption for inferential purposes.

The most relevant variables for predicting neonatal weight include gestational duration, which exhibits a non-linear effect with a more marked increase in the early stages followed by stabilization, head circumference, male sex, and the number of previous pregnancies. Additionally, maternal smoking interacts complexly with gestational age and newborn length, highlighting the importance of considering such interactions in the model.

This summary provides a robust and interpretable model that faithfully describes the data dynamics and lays solid foundations for further considerations.

Model quality analysis

To ensure the model’s robustness and reliability, an analysis is conducted to identify outliers, high-leverage points, and influential observations that could disproportionately impact the estimates. This step is particularly important in light of the heteroscedasticity revealed by the preliminary diagnostics.

The investigation relies on established metrics such as leverage, studentized residuals, and Cook’s distance, complemented by graphical visualizations and tables of the most relevant observations. This approach allows distinguishing between anomalies or errors and biologically plausible but rare values, thus supporting informed decisions on potential model refinements or further analyses.

#Function to diagnosis of outlier and leverage points
diagnostic_plots_ggplot <- function(mod) {
  
  n <- length(residuals(mod))
  p <- length(coefficients(mod))
  
  data_diag <- data.frame(
    Index = 1:n,
    Leverage = hatvalues(mod),
    RStudent = rstudent(mod),
    CooksDistance = cooks.distance(mod)
  )
  
  #Cutoff for influential points
  leverage_cutoff <- 2 * p / n
  rstudent_cutoff <- 2
  cook_cutoff <- 4 / (n - p)
  
  #Flag influential points
  data_diag <- data_diag %>%
    mutate(
      HighLeverage = Leverage > leverage_cutoff,
      Outlier = abs(RStudent) > rstudent_cutoff,
      Influential = CooksDistance > cook_cutoff
    )
  
  #Function to general plot
  plot_diag <- function(df, yvar, ylab, cutoff, flagvar) {
    ggplot(df, aes(x = Index, y = !!sym(yvar))) +
      geom_point(aes(color = !!sym(flagvar))) +
      geom_line() +
      geom_hline(yintercept = cutoff, linetype = "dashed", color = "red") +
      scale_color_manual(values = c("FALSE" = "black", "TRUE" = "red")) +
      labs(title = paste(ylab, "with Influential Points Highlighted"),
           y = ylab, x = "Index") +
      theme_classic()
  }
  
  #Plots
  p1 <- plot_diag(data_diag, "Leverage", "Leverage (hat values)", leverage_cutoff, "HighLeverage")
  p2 <- plot_diag(data_diag, "RStudent", "Residuals student", rstudent_cutoff, "Outlier")
  p3 <- plot_diag(data_diag, "CooksDistance", "Cook's distance", cook_cutoff, "Influential")
  
  #Influential points table
  influential_points <- data_diag %>%
    filter(HighLeverage | Outlier | Influential) %>%
    arrange(desc(CooksDistance)) %>%
    dplyr::select(Index, Leverage, RStudent, CooksDistance, HighLeverage, Outlier, Influential)
  
  #table_kable <- knitr::kable(influential_points, caption = "Table of influential points")
  
  #Return plots
  return(list(LeveragePlot = p1, ResidualsPlot = p2, CookPlot = p3, InfluentialPoints =   influential_points))
}

res <- diagnostic_plots_ggplot(mod7_cent)

#Plots of outlier, leverage and Cook's distance
print(res$LeveragePlot)

print(res$ResidualsPlot)

print(res$CookPlot)

#Extract of influential data
influential_indices <- res$InfluentialPoints$Index
influential_data <- data[influential_indices, 1:10]
influential_data$Index <- influential_indices

#Influential points of dataset
diagnostics_info <- res$InfluentialPoints %>%
  dplyr::select(Index, HighLeverage, Outlier, Influential) 

#Built of join
final_influential <- inner_join(influential_data, diagnostics_info, by = "Index")

kable(final_influential, caption = "Data and diagnosis of influential points") %>%
  kable_styling(full_width = FALSE, position = "left")

Data and diagnosis of influential points
Anni.madre	N.gravidanze	Fumatrici	Gestazione	Peso	Lunghezza	Cranio	Tipo.parto	Ospedale	Sesso	Index	HighLeverage	Outlier	Influential
35	1	NonFumatrici	38	4370	315	374	Nat	osp3	F	1549	TRUE	TRUE	TRUE
25	2	NonFumatrici	25	900	325	253	Nat	osp3	F	1778	TRUE	TRUE	TRUE
24	4	NonFumatrici	29	1280	390	355	Nat	osp1	F	1427	TRUE	TRUE	TRUE
26	0	Fumatrici	39	4930	550	350	Ces	osp2	F	1918	TRUE	FALSE	TRUE
40	3	NonFumatrici	28	1560	420	379	Nat	osp3	F	310	TRUE	TRUE	TRUE
28	0	NonFumatrici	26	930	345	245	Ces	osp3	F	2450	TRUE	FALSE	TRUE
30	0	NonFumatrici	36	3610	410	330	Nat	osp1	M	155	TRUE	TRUE	TRUE
30	1	NonFumatrici	36	1280	385	292	Nat	osp2	F	1426	TRUE	TRUE	TRUE
32	0	NonFumatrici	27	1140	370	267	Nat	osp3	F	2118	TRUE	FALSE	TRUE
35	1	NonFumatrici	32	1890	500	309	Nat	osp2	F	2113	TRUE	TRUE	TRUE
28	1	NonFumatrici	27	980	320	265	Nat	osp1	M	2435	TRUE	FALSE	TRUE
32	4	Fumatrici	39	2250	460	319	Nat	osp2	F	1424	TRUE	FALSE	TRUE
31	0	NonFumatrici	31	990	340	278	Ces	osp2	F	1617	TRUE	TRUE	TRUE
34	1	Fumatrici	40	3180	470	326	Nat	osp1	F	1396	TRUE	FALSE	TRUE
41	3	NonFumatrici	35	1500	420	304	Nat	osp1	M	1591	FALSE	TRUE	TRUE
27	1	NonFumatrici	38	3240	410	359	Ces	osp1	F	2038	TRUE	TRUE	TRUE
31	3	Fumatrici	37	3440	460	362	Ces	osp1	M	2335	TRUE	FALSE	TRUE
27	0	NonFumatrici	35	3140	465	290	Nat	osp2	F	2223	FALSE	TRUE	TRUE
28	3	Fumatrici	41	2940	465	328	Nat	osp2	F	2351	TRUE	FALSE	TRUE
30	4	NonFumatrici	35	4520	520	360	Nat	osp2	F	1551	TRUE	TRUE	TRUE
26	0	Fumatrici	40	3400	525	350	Nat	osp2	M	391	TRUE	FALSE	TRUE
22	1	Fumatrici	40	3350	520	355	Nat	osp3	M	99	TRUE	FALSE	TRUE
25	1	Fumatrici	38	3280	475	324	Nat	osp1	F	567	TRUE	FALSE	TRUE
29	1	Fumatrici	41	4420	530	362	Ces	osp2	F	140	TRUE	FALSE	TRUE
33	1	Fumatrici	38	3030	500	353	Nat	osp3	F	1624	TRUE	FALSE	TRUE
18	0	NonFumatrici	40	1850	460	305	Nat	osp3	F	295	FALSE	TRUE	TRUE
44	0	Fumatrici	38	3150	465	335	Nat	osp3	F	335	TRUE	FALSE	TRUE
33	0	NonFumatrici	29	1620	410	292	Nat	osp3	F	1383	TRUE	FALSE	TRUE
20	0	Fumatrici	42	3870	530	381	Ces	osp1	F	2014	TRUE	FALSE	TRUE
35	6	Fumatrici	38	2430	460	324	Nat	osp2	F	442	TRUE	FALSE	TRUE
30	7	NonFumatrici	35	2220	470	316	Nat	osp3	M	582	TRUE	TRUE	TRUE
42	2	NonFumatrici	38	2560	525	349	Ces	osp2	M	1397	FALSE	TRUE	TRUE
30	3	NonFumatrici	38	4600	485	380	Nat	osp1	M	1292	FALSE	TRUE	TRUE
24	0	NonFumatrici	35	1450	405	280	Nat	osp1	F	750	FALSE	TRUE	TRUE
31	0	NonFumatrici	34	1370	390	287	Nat	osp2	F	101	FALSE	TRUE	TRUE
33	10	NonFumatrici	40	3090	485	353	Nat	osp3	M	2420	TRUE	FALSE	TRUE
36	8	NonFumatrici	41	3730	480	335	Nat	osp3	M	1448	TRUE	FALSE	TRUE
28	0	NonFumatrici	39	3800	520	300	Nat	osp3	F	1710	FALSE	TRUE	TRUE
28	0	Fumatrici	41	3820	500	350	Nat	osp2	M	424	TRUE	FALSE	TRUE
35	2	Fumatrici	39	4030	510	344	Ces	osp2	M	2410	TRUE	FALSE	TRUE
23	1	NonFumatrici	36	3850	460	334	Ces	osp3	F	1692	FALSE	TRUE	TRUE
32	2	NonFumatrici	39	3430	445	322	Ces	osp1	F	1633	FALSE	TRUE	TRUE
31	2	Fumatrici	38	3350	510	348	Nat	osp1	M	2280	TRUE	FALSE	TRUE
35	10	NonFumatrici	39	2950	495	335	Nat	osp1	F	2219	TRUE	FALSE	TRUE
30	1	NonFumatrici	39	3000	475	390	Ces	osp2	F	684	TRUE	TRUE	TRUE
29	0	NonFumatrici	40	3470	525	390	Nat	osp1	M	1866	FALSE	TRUE	TRUE
25	1	NonFumatrici	37	1750	430	305	Nat	osp3	F	322	FALSE	TRUE	TRUE
30	2	NonFumatrici	39	3790	505	304	Ces	osp3	M	1393	FALSE	TRUE	TRUE
22	0	NonFumatrici	32	2340	445	304	Nat	osp1	F	1356	TRUE	TRUE	TRUE
27	4	Fumatrici	41	3350	485	335	Ces	osp3	M	1261	TRUE	FALSE	TRUE
32	4	Fumatrici	39	3400	510	356	Nat	osp1	M	1284	TRUE	FALSE	TRUE
21	1	Fumatrici	40	2590	480	328	Ces	osp1	F	413	TRUE	FALSE	TRUE
33	0	NonFumatrici	36	3110	465	300	Ces	osp3	M	1854	FALSE	TRUE	TRUE
31	0	NonFumatrici	40	3410	550	372	Nat	osp2	M	119	FALSE	TRUE	TRUE
33	0	NonFumatrici	30	1750	410	294	Nat	osp2	M	2198	TRUE	FALSE	TRUE
24	0	NonFumatrici	42	2800	520	340	Ces	osp2	M	2313	FALSE	TRUE	TRUE
34	4	NonFumatrici	42	2660	500	320	Nat	osp2	F	1716	TRUE	TRUE	TRUE
19	1	NonFumatrici	38	1950	445	305	Nat	osp2	M	1741	FALSE	TRUE	TRUE
35	1	Fumatrici	41	2910	490	340	Nat	osp3	M	2108	TRUE	FALSE	TRUE
29	3	Fumatrici	41	3090	495	350	Nat	osp3	M	473	TRUE	FALSE	TRUE
22	1	NonFumatrici	34	3030	470	312	Nat	osp2	F	1891	FALSE	TRUE	TRUE
28	1	NonFumatrici	40	3790	460	332	Nat	osp2	F	1267	FALSE	TRUE	TRUE
40	8	NonFumatrici	38	3520	470	341	Nat	osp3	M	516	TRUE	FALSE	TRUE
28	0	Fumatrici	40	2740	500	325	Nat	osp3	F	2001	TRUE	FALSE	TRUE
37	2	Fumatrici	41	3220	500	357	Ces	osp3	M	1689	TRUE	FALSE	TRUE
27	2	Fumatrici	39	2920	490	351	Ces	osp3	M	1523	TRUE	FALSE	TRUE
27	0	Fumatrici	42	3970	515	360	Nat	osp2	M	1658	TRUE	FALSE	TRUE
23	0	NonFumatrici	41	4900	510	352	Nat	osp2	F	1305	FALSE	TRUE	TRUE
31	4	Fumatrici	38	3530	495	335	Nat	osp1	F	1969	TRUE	FALSE	TRUE
30	8	NonFumatrici	39	2860	490	337	Ces	osp2	F	1503	TRUE	FALSE	TRUE
26	3	NonFumatrici	31	1960	420	300	Nat	osp2	F	1067	TRUE	FALSE	TRUE
27	0	Fumatrici	41	3160	490	323	Nat	osp1	F	1026	TRUE	FALSE	TRUE
23	1	NonFumatrici	40	3520	445	363	Nat	osp1	F	220	FALSE	FALSE	TRUE
26	1	NonFumatrici	30	1170	370	273	Nat	osp3	M	2305	TRUE	FALSE	TRUE
26	8	NonFumatrici	40	3250	500	355	Nat	osp2	M	2084	TRUE	FALSE	TRUE
29	1	NonFumatrici	41	4010	470	353	Nat	osp2	M	1229	FALSE	TRUE	TRUE
38	4	NonFumatrici	38	4370	530	340	Nat	osp3	F	1960	FALSE	TRUE	TRUE
34	10	NonFumatrici	38	2880	470	345	Ces	osp2	M	2469	TRUE	FALSE	TRUE
38	0	NonFumatrici	40	3700	470	320	Nat	osp1	M	390	FALSE	TRUE	TRUE
38	3	NonFumatrici	41	2320	450	330	Nat	osp2	F	656	FALSE	TRUE	TRUE
29	1	Fumatrici	40	4180	520	355	Ces	osp1	M	2047	TRUE	FALSE	TRUE
34	2	NonFumatrici	33	1410	380	295	Nat	osp2	F	492	TRUE	FALSE	TRUE
41	1	NonFumatrici	39	2160	450	325	Ces	osp3	M	1193	FALSE	TRUE	TRUE
37	0	NonFumatrici	41	2420	490	300	Ces	osp1	M	1554	FALSE	TRUE	TRUE
22	0	NonFumatrici	32	1430	380	301	Nat	osp1	M	1699	TRUE	FALSE	TRUE
40	6	NonFumatrici	34	1840	430	305	Nat	osp1	F	1310	TRUE	FALSE	TRUE
22	0	Fumatrici	39	3290	510	346	Nat	osp3	M	1761	TRUE	FALSE	TRUE
32	1	NonFumatrici	33	2040	480	307	Ces	osp1	F	1272	FALSE	FALSE	TRUE
26	1	Fumatrici	38	3060	490	350	Nat	osp3	M	2283	TRUE	FALSE	TRUE
30	2	Fumatrici	40	3130	480	320	Nat	osp1	M	1759	TRUE	FALSE	TRUE
38	6	NonFumatrici	37	3950	500	350	Nat	osp3	M	134	FALSE	FALSE	TRUE
27	1	NonFumatrici	35	2100	440	345	Ces	osp2	F	615	FALSE	TRUE	TRUE
29	0	NonFumatrici	42	3540	540	368	Nat	osp1	M	616	FALSE	TRUE	TRUE
25	6	NonFumatrici	38	2530	460	340	Nat	osp1	F	2315	TRUE	FALSE	TRUE
33	2	Fumatrici	40	3320	490	320	Nat	osp3	M	1650	TRUE	FALSE	TRUE
32	3	NonFumatrici	36	3780	480	349	Nat	osp3	M	1340	FALSE	TRUE	TRUE
30	0	NonFumatrici	38	4540	530	343	Ces	osp3	M	1539	FALSE	TRUE	TRUE
22	0	NonFumatrici	42	3100	510	361	Nat	osp2	F	1586	FALSE	TRUE	TRUE
31	0	NonFumatrici	37	2040	465	305	Nat	osp2	F	1508	FALSE	TRUE	TRUE
35	0	NonFumatrici	33	1390	390	277	Nat	osp1	F	748	TRUE	FALSE	TRUE
23	1	Fumatrici	41	3590	500	349	Nat	osp3	F	105	TRUE	FALSE	TRUE
38	3	Fumatrici	40	3300	510	345	Nat	osp3	M	234	TRUE	FALSE	TRUE
40	0	NonFumatrici	34	1580	400	300	Nat	osp2	F	2255	FALSE	FALSE	TRUE
27	1	NonFumatrici	38	2480	500	315	Ces	osp2	M	1913	FALSE	TRUE	TRUE
37	3	Fumatrici	39	3150	470	341	Nat	osp1	F	1395	TRUE	FALSE	TRUE
32	5	NonFumatrici	36	2400	470	333	Nat	osp1	F	1008	FALSE	FALSE	TRUE
35	1	NonFumatrici	39	2660	460	364	Ces	osp1	F	1400	FALSE	FALSE	TRUE
37	1	NonFumatrici	39	2500	490	352	Ces	osp1	M	2217	FALSE	TRUE	TRUE
29	4	NonFumatrici	30	1340	400	273	Ces	osp1	M	106	TRUE	FALSE	TRUE
26	1	NonFumatrici	30	1170	370	266	Nat	osp2	M	1247	TRUE	FALSE	TRUE
30	1	NonFumatrici	41	4440	510	335	Nat	osp3	M	791	FALSE	TRUE	TRUE
32	6	NonFumatrici	40	4200	510	350	Nat	osp1	M	828	FALSE	FALSE	TRUE
35	9	NonFumatrici	42	3760	540	348	Nat	osp2	F	161	TRUE	FALSE	TRUE
33	3	NonFumatrici	40	3640	470	337	Nat	osp3	F	2285	FALSE	TRUE	TRUE
39	2	NonFumatrici	35	1980	450	308	Ces	osp1	M	724	FALSE	TRUE	TRUE
20	0	Fumatrici	38	2990	470	330	Nat	osp2	F	557	TRUE	FALSE	TRUE
30	0	NonFumatrici	40	2380	480	305	Nat	osp2	M	1217	FALSE	FALSE	TRUE
27	1	NonFumatrici	38	2150	450	315	Ces	osp3	M	458	FALSE	TRUE	TRUE
26	0	NonFumatrici	34	2050	460	325	Nat	osp1	F	1211	FALSE	FALSE	TRUE
37	4	NonFumatrici	37	3720	480	355	Ces	osp1	F	1470	FALSE	FALSE	TRUE
39	1	NonFumatrici	31	1500	405	295	Nat	osp3	M	206	TRUE	FALSE	FALSE
25	0	NonFumatrici	39	3290	450	338	Nat	osp3	F	2056	FALSE	TRUE	FALSE
30	2	Fumatrici	37	3280	500	344	Ces	osp1	M	128	TRUE	FALSE	FALSE
40	2	Fumatrici	38	3440	495	337	Nat	osp3	F	1900	TRUE	FALSE	FALSE
27	2	NonFumatrici	39	2260	460	330	Nat	osp2	F	1486	FALSE	TRUE	FALSE
37	2	NonFumatrici	36	3710	480	350	Nat	osp3	M	2130	FALSE	TRUE	FALSE
18	0	NonFumatrici	36	3050	460	315	Nat	osp2	F	1556	FALSE	TRUE	FALSE
23	2	Fumatrici	40	2950	495	340	Ces	osp2	F	1037	TRUE	FALSE	FALSE
22	1	Fumatrici	40	3670	500	340	Nat	osp1	M	120	TRUE	FALSE	FALSE
22	0	NonFumatrici	32	2580	470	330	Nat	osp1	F	2214	TRUE	FALSE	FALSE
33	2	NonFumatrici	39	3660	490	318	Nat	osp2	F	890	FALSE	TRUE	FALSE
38	1	NonFumatrici	40	3980	480	335	Nat	osp1	F	2193	FALSE	TRUE	FALSE
19	0	NonFumatrici	36	2100	445	332	Ces	osp1	F	1754	FALSE	TRUE	FALSE
22	0	Fumatrici	40	4220	510	375	Nat	osp1	M	1581	TRUE	FALSE	FALSE
23	3	Fumatrici	41	2620	475	313	Nat	osp3	M	2218	TRUE	FALSE	FALSE
30	0	Fumatrici	38	3380	490	340	Nat	osp3	F	1169	TRUE	FALSE	FALSE
26	1	NonFumatrici	32	1280	360	276	Nat	osp2	M	312	TRUE	FALSE	FALSE
31	1	NonFumatrici	39	3400	460	325	Nat	osp1	M	377	FALSE	TRUE	FALSE
30	2	NonFumatrici	39	4240	485	352	Nat	osp2	M	130	FALSE	TRUE	FALSE
18	1	Fumatrici	40	3600	500	335	Nat	osp3	M	296	TRUE	FALSE	FALSE
35	2	NonFumatrici	39	3300	460	320	Ces	osp1	M	908	FALSE	TRUE	FALSE
33	3	NonFumatrici	39	3950	480	345	Nat	osp3	M	2133	FALSE	TRUE	FALSE
24	1	NonFumatrici	40	3820	500	320	Ces	osp2	F	146	FALSE	TRUE	FALSE
36	1	NonFumatrici	41	2940	500	358	Nat	osp3	M	1935	FALSE	TRUE	FALSE
30	1	NonFumatrici	39	2380	495	325	Nat	osp1	F	1497	FALSE	TRUE	FALSE
27	2	Fumatrici	37	2850	460	335	Nat	osp1	F	251	TRUE	FALSE	FALSE
33	0	NonFumatrici	37	2230	455	326	Nat	osp1	M	1748	FALSE	TRUE	FALSE
26	0	Fumatrici	40	3570	500	335	Nat	osp2	M	1456	TRUE	FALSE	FALSE
21	1	Fumatrici	41	3270	510	340	Nat	osp2	M	1904	TRUE	FALSE	FALSE
21	0	NonFumatrici	37	2750	510	333	Nat	osp1	M	632	FALSE	TRUE	FALSE
30	0	Fumatrici	40	3120	500	344	Nat	osp3	M	984	TRUE	FALSE	FALSE
27	0	NonFumatrici	28	1285	400	274	Nat	osp1	F	378	TRUE	FALSE	FALSE
22	2	NonFumatrici	38	2410	470	336	Nat	osp2	M	364	FALSE	TRUE	FALSE
24	2	NonFumatrici	40	2560	490	323	Nat	osp1	M	1191	FALSE	TRUE	FALSE
28	1	Fumatrici	40	2850	490	334	Nat	osp1	F	2143	TRUE	FALSE	FALSE
35	3	NonFumatrici	40	2660	485	335	Ces	osp3	M	1286	FALSE	TRUE	FALSE
35	1	NonFumatrici	38	2850	500	360	Nat	osp1	F	648	FALSE	TRUE	FALSE
25	2	NonFumatrici	41	3990	495	335	Nat	osp3	M	472	FALSE	TRUE	FALSE
32	2	Fumatrici	38	2920	480	348	Nat	osp2	F	1051	TRUE	FALSE	FALSE
28	1	NonFumatrici	40	2410	470	319	Nat	osp1	M	417	FALSE	TRUE	FALSE
34	0	NonFumatrici	39	3180	470	310	Nat	osp3	F	195	FALSE	TRUE	FALSE
26	2	Fumatrici	39	3530	505	332	Ces	osp2	F	194	TRUE	FALSE	FALSE
26	0	NonFumatrici	40	4330	500	355	Nat	osp3	F	1036	FALSE	TRUE	FALSE
24	0	NonFumatrici	38	2400	460	335	Nat	osp3	M	1877	FALSE	TRUE	FALSE
21	0	NonFumatrici	41	2700	500	336	Nat	osp2	F	653	FALSE	TRUE	FALSE
37	8	NonFumatrici	28	930	355	235	Nat	osp1	F	2173	TRUE	FALSE	FALSE
30	0	NonFumatrici	40	3860	480	338	Nat	osp1	M	633	FALSE	TRUE	FALSE
24	1	Fumatrici	39	3400	485	344	Nat	osp1	F	242	TRUE	FALSE	FALSE
25	0	NonFumatrici	41	2980	500	356	Nat	osp3	M	950	FALSE	TRUE	FALSE
37	3	NonFumatrici	33	2000	470	293	Ces	osp1	F	1608	TRUE	FALSE	FALSE
21	0	NonFumatrici	39	3580	485	320	Nat	osp3	M	318	FALSE	TRUE	FALSE
27	1	NonFumatrici	39	4650	510	354	Nat	osp2	M	2021	FALSE	TRUE	FALSE
34	3	NonFumatrici	40	4580	515	360	Nat	osp2	M	1836	FALSE	TRUE	FALSE
21	0	NonFumatrici	39	3950	500	330	Nat	osp1	F	1132	FALSE	TRUE	FALSE
26	0	NonFumatrici	40	3580	480	326	Nat	osp1	F	418	FALSE	TRUE	FALSE
17	0	NonFumatrici	37	2050	390	295	Nat	osp2	F	1014	TRUE	FALSE	FALSE
35	1	Fumatrici	39	2440	450	326	Nat	osp1	F	1332	TRUE	FALSE	FALSE
20	0	NonFumatrici	38	3700	480	335	Nat	osp3	F	5	FALSE	TRUE	FALSE
20	1	NonFumatrici	40	2510	480	322	Ces	osp2	M	1419	FALSE	TRUE	FALSE
30	1	NonFumatrici	32	2260	440	322	Ces	osp3	F	592	TRUE	FALSE	FALSE
30	6	NonFumatrici	35	2680	450	322	Nat	osp1	F	757	TRUE	FALSE	FALSE
28	1	NonFumatrici	39	2600	480	350	Nat	osp1	F	262	FALSE	TRUE	FALSE
25	1	NonFumatrici	41	2950	515	333	Ces	osp2	M	460	FALSE	TRUE	FALSE
24	0	NonFumatrici	38	3920	505	333	Nat	osp3	M	2010	FALSE	TRUE	FALSE
25	1	Fumatrici	42	2980	490	335	Nat	osp2	M	758	TRUE	FALSE	FALSE
26	0	NonFumatrici	40	2660	500	343	Nat	osp1	F	2396	FALSE	TRUE	FALSE
34	1	Fumatrici	40	3310	490	330	Nat	osp3	M	1118	TRUE	FALSE	FALSE
28	0	Fumatrici	39	3310	500	355	Ces	osp1	M	1718	TRUE	FALSE	FALSE
35	3	Fumatrici	38	2750	475	332	Nat	osp1	F	1670	TRUE	FALSE	FALSE
34	0	NonFumatrici	41	3940	500	340	Nat	osp1	F	623	FALSE	TRUE	FALSE
24	0	NonFumatrici	38	2920	505	351	Ces	osp3	M	455	FALSE	TRUE	FALSE
34	1	NonFumatrici	40	2580	495	325	Nat	osp3	F	1210	FALSE	TRUE	FALSE
25	6	NonFumatrici	33	2230	430	313	Nat	osp3	F	2357	TRUE	FALSE	FALSE
30	8	NonFumatrici	40	3850	518	340	Nat	osp3	F	204	TRUE	FALSE	FALSE
39	1	Fumatrici	39	2770	470	342	Nat	osp3	F	1574	TRUE	FALSE	FALSE
27	0	NonFumatrici	32	1550	410	289	Nat	osp1	F	445	TRUE	FALSE	FALSE
28	0	NonFumatrici	38	3720	500	330	Nat	osp3	F	1517	FALSE	TRUE	FALSE
29	0	Fumatrici	37	2440	460	315	Nat	osp1	F	658	TRUE	FALSE	FALSE
27	2	Fumatrici	39	2880	490	325	Ces	osp1	M	2359	TRUE	FALSE	FALSE
26	0	NonFumatrici	40	3000	520	340	Ces	osp3	M	1137	FALSE	TRUE	FALSE
36	0	NonFumatrici	31	1180	355	270	Nat	osp3	F	2112	TRUE	FALSE	FALSE
32	1	Fumatrici	33	1780	400	305	Ces	osp1	F	2087	TRUE	FALSE	FALSE
30	1	Fumatrici	37	4050	530	350	Nat	osp1	M	2243	TRUE	FALSE	FALSE
35	5	Fumatrici	40	4300	530	360	Nat	osp3	M	2044	TRUE	FALSE	FALSE
25	0	NonFumatrici	40	2990	515	350	Ces	osp1	F	2183	FALSE	TRUE	FALSE
19	0	Fumatrici	40	3380	500	361	Nat	osp2	M	1441	TRUE	FALSE	FALSE
35	0	NonFumatrici	32	1780	420	277	Ces	osp1	F	1807	TRUE	FALSE	FALSE
27	1	NonFumatrici	40	3890	495	330	Nat	osp2	M	2368	FALSE	TRUE	FALSE
28	1	NonFumatrici	39	3920	485	345	Nat	osp2	F	850	FALSE	TRUE	FALSE
25	0	NonFumatrici	28	830	310	254	Nat	osp1	F	928	TRUE	FALSE	FALSE
27	0	Fumatrici	39	2990	485	350	Nat	osp1	F	1931	TRUE	FALSE	FALSE
31	1	NonFumatrici	38	3920	490	345	Nat	osp2	M	1942	FALSE	TRUE	FALSE
25	0	NonFumatrici	41	2210	430	310	Nat	osp3	F	956	TRUE	FALSE	FALSE
34	0	NonFumatrici	39	2690	495	335	Nat	osp1	M	2149	FALSE	TRUE	FALSE
31	1	NonFumatrici	38	2670	500	336	Ces	osp2	F	980	FALSE	TRUE	FALSE
25	0	Fumatrici	40	3860	520	345	Nat	osp1	M	703	TRUE	FALSE	FALSE
39	3	NonFumatrici	30	1300	380	276	Nat	osp1	M	2147	TRUE	FALSE	FALSE
21	0	NonFumatrici	40	4050	500	340	Nat	osp3	M	1583	FALSE	TRUE	FALSE
31	1	Fumatrici	40	3310	490	334	Ces	osp2	M	2457	TRUE	FALSE	FALSE
29	1	NonFumatrici	39	3800	490	333	Ces	osp3	M	90	FALSE	TRUE	FALSE
30	2	NonFumatrici	29	1190	360	272	Nat	osp2	F	805	TRUE	FALSE	FALSE
30	1	Fumatrici	39	3100	485	325	Nat	osp3	F	2200	TRUE	FALSE	FALSE
26	1	NonFumatrici	39	2850	500	347	Ces	osp1	M	2422	FALSE	TRUE	FALSE
25	1	NonFumatrici	39	2720	495	345	Nat	osp3	F	361	FALSE	TRUE	FALSE
23	1	NonFumatrici	39	4100	505	350	Nat	osp1	F	1206	FALSE	TRUE	FALSE
34	1	Fumatrici	39	2910	490	325	Nat	osp2	M	932	TRUE	FALSE	FALSE
31	1	Fumatrici	40	2850	485	330	Nat	osp1	F	1110	TRUE	FALSE	FALSE
38	12	NonFumatrici	39	3350	490	344	Nat	osp2	M	1218	TRUE	FALSE	FALSE
24	3	Fumatrici	40	3020	490	338	Nat	osp2	F	1325	TRUE	FALSE	FALSE
36	8	NonFumatrici	36	2860	460	334	Nat	osp2	F	1725	TRUE	FALSE	FALSE
37	2	NonFumatrici	31	1690	405	290	Nat	osp2	M	2406	TRUE	FALSE	FALSE
19	0	Fumatrici	39	2710	470	331	Nat	osp3	F	321	TRUE	FALSE	FALSE
16	1	NonFumatrici	31	1900	440	300	Nat	osp2	F	587	TRUE	FALSE	FALSE
33	11	NonFumatrici	43	3400	475	360	Nat	osp1	M	1130	TRUE	FALSE	FALSE
21	1	Fumatrici	40	3550	510	335	Ces	osp3	M	1413	TRUE	FALSE	FALSE
32	6	NonFumatrici	39	3290	480	360	Ces	osp2	M	1409	TRUE	FALSE	FALSE
27	0	NonFumatrici	31	1800	430	308	Ces	osp3	M	1684	TRUE	FALSE	FALSE
23	0	Fumatrici	39	3100	500	325	Nat	osp2	M	1367	TRUE	FALSE	FALSE
27	1	Fumatrici	37	3110	485	348	Nat	osp2	F	182	TRUE	FALSE	FALSE
32	0	Fumatrici	40	3510	500	346	Nat	osp2	M	1281	TRUE	FALSE	FALSE
30	1	NonFumatrici	33	1770	410	275	Nat	osp3	M	1091	TRUE	FALSE	FALSE
36	6	NonFumatrici	40	4340	515	383	Nat	osp1	M	1320	TRUE	FALSE	FALSE
20	0	NonFumatrici	41	2280	450	280	Ces	osp3	M	151	TRUE	FALSE	FALSE
25	0	Fumatrici	41	3640	510	350	Nat	osp3	M	572	TRUE	FALSE	FALSE
27	0	Fumatrici	40	3870	530	340	Nat	osp1	M	668	TRUE	FALSE	FALSE
24	1	Fumatrici	39	2830	485	315	Ces	osp1	M	1017	TRUE	FALSE	FALSE
26	1	Fumatrici	37	3140	490	335	Nat	osp2	F	2236	TRUE	FALSE	FALSE
36	4	Fumatrici	40	3480	495	347	Nat	osp1	M	1226	TRUE	FALSE	FALSE
33	6	NonFumatrici	37	2900	465	350	Ces	osp2	F	2146	TRUE	FALSE	FALSE
36	2	Fumatrici	40	3040	470	345	Nat	osp1	M	2463	TRUE	FALSE	FALSE
28	2	Fumatrici	41	3860	520	353	Nat	osp2	M	1174	TRUE	FALSE	FALSE
32	1	Fumatrici	40	3170	500	330	Ces	osp2	M	593	TRUE	FALSE	FALSE
31	0	NonFumatrici	31	1730	430	300	Nat	osp3	F	2456	TRUE	FALSE	FALSE
33	0	Fumatrici	40	2940	470	352	Nat	osp2	F	1408	TRUE	FALSE	FALSE
31	1	Fumatrici	39	4020	530	356	Nat	osp3	F	1535	TRUE	FALSE	FALSE
24	5	NonFumatrici	42	3600	510	335	Ces	osp3	M	985	TRUE	FALSE	FALSE
28	1	Fumatrici	40	3270	480	363	Nat	osp1	M	538	TRUE	FALSE	FALSE
29	1	Fumatrici	39	3110	500	327	Nat	osp1	F	2235	TRUE	FALSE	FALSE
35	9	NonFumatrici	37	3150	490	335	Nat	osp2	M	1779	TRUE	FALSE	FALSE
33	1	Fumatrici	40	3860	525	346	Ces	osp1	M	2448	TRUE	FALSE	FALSE
25	1	Fumatrici	37	2740	475	321	Nat	osp2	F	1787	TRUE	FALSE	FALSE
21	0	Fumatrici	40	3050	495	322	Ces	osp3	F	1510	TRUE	FALSE	FALSE
38	2	Fumatrici	38	2970	465	349	Nat	osp3	M	1478	TRUE	FALSE	FALSE
19	0	Fumatrici	40	3320	500	335	Nat	osp1	M	279	TRUE	FALSE	FALSE
24	1	Fumatrici	38	2850	465	340	Ces	osp2	F	1471	TRUE	FALSE	FALSE
23	0	Fumatrici	38	3310	495	340	Ces	osp3	M	306	TRUE	FALSE	FALSE
34	3	NonFumatrici	32	1615	390	297	Nat	osp3	F	947	TRUE	FALSE	FALSE
20	0	Fumatrici	39	2920	490	320	Nat	osp2	F	699	TRUE	FALSE	FALSE
36	8	NonFumatrici	39	3610	500	351	Ces	osp1	M	89	TRUE	FALSE	FALSE
24	0	Fumatrici	40	3130	490	342	Ces	osp2	F	2254	TRUE	FALSE	FALSE
33	1	Fumatrici	37	2570	460	320	Nat	osp2	F	1378	TRUE	FALSE	FALSE
32	1	Fumatrici	37	3040	490	330	Nat	osp2	F	1447	TRUE	FALSE	FALSE
25	3	Fumatrici	40	3250	495	334	Nat	osp1	M	2268	TRUE	FALSE	FALSE
26	1	Fumatrici	39	3080	485	340	Nat	osp2	F	1270	TRUE	FALSE	FALSE
20	0	Fumatrici	36	2650	470	312	Ces	osp3	M	1927	TRUE	FALSE	FALSE
21	1	Fumatrici	39	3250	500	336	Nat	osp2	F	1423	TRUE	FALSE	FALSE
34	2	Fumatrici	40	3380	485	361	Nat	osp2	M	2099	TRUE	FALSE	FALSE

#Plot for diagnosis of detected values
plot_diagnosis <- function(data, xvar, yvar = "Peso", groupvar = NULL) {
  p <- ggplot(data, aes_string(x = xvar, y = yvar)) +
    geom_point(alpha = 0.5) +
    geom_point(data=filter(data, HighLeverage), color="red", size=3) +
    geom_point(data=filter(data, Outlier), shape=1, color="blue", size=3) +
    geom_point(data=filter(data, Influential), shape=4, color="black", size=3) +
    geom_smooth(method = "loess")+
    theme_classic()
    
  if (!is.null(groupvar)) {
    p <- p + facet_wrap(as.formula(paste0("~", groupvar)))
  }
  
  p <- p + 
    theme_minimal() +
    labs(
      title = paste("Influential points highlited on", yvar, "vs", xvar),
      subtitle = "Red: HighLeverage, Blue: Outlier, Black: Influential",
      x = xvar,
      y = yvar
    )
  
  return(p)
}

vars_to_plot <- c("Gestazione", "Lunghezza", "N.gravidanze", "Cranio")

plots_list <- lapply(vars_to_plot, function(var) {
  # Add grouping  only if variables is Gestazione or Lunghezza
  if (var %in% c("Gestazione", "Lunghezza")) {
    group_var <- "Fumatrici"
  } else {
    group_var <- "Sesso"
  }
  plot_diagnosis(final_influential, xvar = var, yvar = "Peso", groupvar = group_var)
})

#Plots for each regressor
print(plots_list[[1]])

print(plots_list[[2]])

print(plots_list[[3]])

print(plots_list[[4]])

#Plot of residuals vs fitted
df_resid <- data.frame(
  Fitted = fitted(mod7_cent),
  Residuals = resid(mod7_cent),
  StdResid = rstandard(mod7_cent),
  Index = 1:length(resid(mod7_cent))
)

#Residuals vs fitted
ggplot(df_resid, aes(x = Fitted, y = Residuals)) +
  geom_point(alpha=0.5) +
  geom_hline(yintercept = 0, linetype="dashed") +
  geom_smooth(method = "loess", se=FALSE, color="red") +
  labs(title = "Residuals vs predicted values", x = "Predicted values", y = "Residuals") +
  theme_classic()

#Calculation of R^2 and RMSE of the best model
R_squared <- summary(mod7_cent)$r.squared
R_squared <- round(R_squared, 4)

#Calculation of RMSE of the best model
predicted <- predict(mod7_cent)
actual <- log(data$Peso)
RMSE <- sqrt(mean((actual - predicted)^2))
RMSE <- round(RMSE, 4)

#Table of R^2 and RMSE of the best model
df_res <- data.frame(R_squared, RMSE)
knitr::kable(df_res, caption = "Table of R^2 and RMSE of the best model")

Table of R^2 and RMSE of the best model
R_squared	RMSE
0.7903	0.0839

Key observations from the diagnostic evaluation of the final model:

Outliers and influential points: the analysis of leverage, studentized residuals, and Cook’s distance identifies few cases with potential significant influence. These observations have extreme characteristics (e.g., multiple pregnancies or body measurements outside the common range) but are biologically plausible and not necessarily data errors.
Variable interactions: some influential observations are linked to specific combinations of key variables such as gestational age, length, and maternal smoking, reflecting important interactions in the model. Considering the overall good quality and interpretability, no modifications or exclusions are performed.
Residual analysis: plots of residuals versus fitted values show a generally homogeneous distribution without systematic patterns, confirming the model’s adequacy. However, some heteroscedasticity remains, which is addressed by the logarithmic transformation of the response variable.
Model adequacy and predictive accuracy: with an R² of about 0.79, the model explains a large portion of the variability in the log-transformed birth weight. The RMSE of about 0.084 indicates good predictive accuracy and limited discrepancy between observed and predicted values, confirming the model’s validity and reliability.
Overall impact: although influential points are identified, no evidence arises that compromises the overall validity or predictive power. The model balances parsimony and clarity in presenting relevant effects and adequately captures data variability. For this reason and to preserve these characteristics, it has been decided not to further explore robust model management methods.

Prediction and results

This section demonstrates how to use the model to make predictions of the newborn’s weight based on the model parameters, highlighting how the model performs inference to find the best estimate for the target variable.

#Prediction with some example values
predict_weight <- function(N_gravidanze, Fumatrici, Gestazione, Lunghezza, Cranio, Sesso, model, mean_gest, mean_lung) {
  
  #Control of factorial levels
  if(!Fumatrici %in% levels(model$model$Fumatrici)) {
    stop("Invalid value for Fumatrici")
  }
  if(!Sesso %in% unique(model$model$Sesso)) {
    stop("Invalid value for Sesso")
  }
  
  #Centering of variable
  Gestazione_c <- as.double(Gestazione - mean_gest)
  Gestazione_c2 <- as.double(Gestazione_c^2)
  Lunghezza_c <- as.double(Lunghezza - mean_lung)
  
  #Creation of matrix 1x1 as the same class of variables in model
  Gestazione_c <- matrix(Gestazione_c, nrow = 1, ncol = 1)
  Gestazione_c2 <- matrix(Gestazione_c2, nrow = 1, ncol = 1)
  Lunghezza_c <- matrix(Lunghezza_c, nrow = 1, ncol = 1)
  
  #New data
  newdata <- data.frame(
    N.gravidanze = as.numeric(N_gravidanze),
    Fumatrici = factor(Fumatrici, levels = levels(model$model$Fumatrici)),
    Cranio = as.numeric(Cranio),
    Sesso = as.character(Sesso)
  )
  
  #Add matrix for variables Gestazione_c, Gestazione_c2 and Lunghezza_c
  newdata$Gestazione_c <- I(Gestazione_c)
  newdata$Gestazione_c2 <- I(Gestazione_c2)
  newdata$Lunghezza_c <- I(Lunghezza_c)
  
  #Logaritmic prediction
  log_pred <- predict(model, newdata = newdata)
  
  #Return of transformed prediction
  weight_pred <- exp(log_pred)
  
  return(weight_pred)
}

#Mean calculation out of the function for centering variables
mean_gest <- mean(data$Gestazione, na.rm = TRUE)
mean_lung <- mean(data$Lunghezza, na.rm = TRUE)

predicted_weight <- predict_weight(
  N_gravidanze = 3,
  Fumatrici = "NonFumatrici",
  Gestazione = 39,
  Lunghezza = 490, 
  Cranio = 337,
  Sesso = "F",
  model = mod7_cent,
  mean_gest = mean_gest,
  mean_lung = mean_lung
)

Predicted birth weight: 3173 grams

The provided function can be used to generate different prediction examples by modifying the variable values in each call. In this case, considering a mother in her third pregnancy, non-smoker, with 39 weeks of gestation and a female newborn measuring 490 mm in length and 337 mm in head circumference, the model estimates a birth weight of approximately 3200 g.

No external validation or formal cross-validation of the predictive model has been conducted. However, these practices represent a fundamental step to evaluate the generalizability and robustness of the model on independent data or in different clinical settings. Their implementation would allow confirmation of the reliability of the estimates in different populations and further enhance confidence in the clinical use of the model.

Future analyses would be even more reliable by including external validation on independent datasets or cross-validation to assess and, if necessary, optimize predictive performance with a view to transferability and broader practical application.

Visualizations

This section presents the main results of the model, interpreted within the relevant clinical context, visualizing:

The behavior of the target variable as a function of the predictors, to highlight the specific impact each predictor has on the final prediction.
The comparison between predicted and real values of weight, to highlight prediction for central and outlier values of the distribution.
The analysis of residuals among groups of Fumatrici and Sesso, to highlight if the model introduce or not bias into groups.

#Function to plot relationship with regressors and target variable
create_plots <- function(predict_func, model, mean_gest, mean_lung, fixed_vals_list) {
  
  #Create levels for categorical variables
  fumatrici_levels <- levels(model$model$Fumatrici)
  sesso_levels <- levels(model$model$Sesso)
  
  #Plot 1: Peso vs Gestazione and Fumatrici
  gest_seq <- seq(25, 45, 0.5)
  
  df1 <- expand.grid(Gestazione = gest_seq, Fumatrici = fumatrici_levels)
  df1$Predicted <- mapply(function(g,f) {
    predict_func(N_gravidanze = fixed_vals_list$p1$N_gravidanze,
                 Fumatrici = f,
                 Gestazione = g,
                 Lunghezza = fixed_vals_list$p1$Lunghezza,
                 Cranio = fixed_vals_list$p1$Cranio,
                 Sesso = fixed_vals_list$p1$Sesso,
                 model = model,
                 mean_gest = mean_gest,
                 mean_lung = mean_lung)
  }, df1$Gestazione, df1$Fumatrici)
  
  p1 <- ggplot(df1, aes(x=Gestazione, y=Predicted, color=Fumatrici)) +
    geom_point() + 
    geom_smooth(method = "loess")+
    labs(title="Pred. Weight vs Gestazione and Fumatrici") +
    theme_classic()
  
  #Plot 2: Peso vs Lunghezza and Fumatrici
  lung_seq <- seq(300, 560, 0.5)
  df2 <- expand.grid(Lunghezza = lung_seq, Fumatrici = fumatrici_levels)
  df2$Predicted <- mapply(function(l,f) {
    pred <- predict_func(N_gravidanze = fixed_vals_list$p2$N_gravidanze,
                 Fumatrici = f,
                 Gestazione = fixed_vals_list$p2$Gestazione,
                 Lunghezza = l,
                 Cranio = fixed_vals_list$p2$Cranio,
                 Sesso = fixed_vals_list$p2$Sesso,
                 model = model,
                 mean_gest = mean_gest,
                 mean_lung = mean_lung)
  }, df2$Lunghezza, df2$Fumatrici)
  
  p2 <- ggplot(df2, aes(x=Lunghezza, y=Predicted, color=Fumatrici)) +
    geom_point() + 
    geom_smooth(method = "loess")+
    labs(title="Pred.Weight vs Lunghezza and Fumatrici") +
    theme_classic()
 
  #Plot 3: Peso vs N.Gravidanze and Sesso
  ng_seq <- seq(0, 10, 1)
  df3 <- expand.grid(N_gravidanze = ng_seq, Sesso = sesso_levels)
  df3$Predicted <- mapply(function(ng,s) {
    predict_func(N_gravidanze = ng,
                 Fumatrici = fixed_vals_list$p3$Fumatrici,
                 Gestazione = fixed_vals_list$p3$Gestazione,
                 Lunghezza = fixed_vals_list$p3$Lunghezza,
                 Cranio = fixed_vals_list$p3$Cranio,
                 Sesso = s,
                 model = model,
                 mean_gest = mean_gest,
                 mean_lung = mean_lung)
  }, df3$N_gravidanze, df3$Sesso)
  
  p3 <- ggplot(df3, aes(x=N_gravidanze, y=Predicted, color=Sesso)) +
    geom_point() + 
    geom_smooth(method = "loess")+
    labs(title="Pred. Weight vs N.gravidanze and Sesso") +
    theme_classic()
  
  #Plot 4: Peso vs Cranio and Sesso
  c_seq <- seq(230, 400, 0.5)
  df4 <- expand.grid(Cranio = c_seq, Sesso = sesso_levels)
  df4$Predicted <- mapply(function(c,s) {
    predict_func(N_gravidanze = fixed_vals_list$p4$N_gravidanze,
                 Fumatrici = fixed_vals_list$p4$Fumatrici,
                 Gestazione = fixed_vals_list$p4$Gestazione,
                 Lunghezza = fixed_vals_list$p4$Lunghezza,
                 Cranio = c,
                 Sesso = s,
                 model = model,
                 mean_gest = mean_gest,
                 mean_lung = mean_lung)
  }, df4$Cranio, df4$Sesso)
  
  p4 <- ggplot(df4, aes(x=Cranio, y=Predicted, color=Sesso)) +
    geom_point() + 
    geom_smooth(method = "loess")+
    labs(title="Pred. Weight vs Cranio and Sesso") +
    theme_classic()
  
  return(list(p1, p2, p3, p4))
}

#Creation of fixed vals list
fixed_vals_list <- list(
  p1 = list(N_gravidanze=3, Lunghezza=490, Cranio=337, Sesso="F"),
  p2 = list(N_gravidanze=3, Gestazione=39, Cranio=337, Sesso="F"),
  p3 = list(Fumatrici="NonFumatrici", Gestazione=39, Lunghezza=490, Cranio=337),
  p4 = list(N_gravidanze=3, Fumatrici="NonFumatrici", Gestazione=39, Lunghezza=490)
)

#Execution and visualization
plots <- create_plots(predict_weight, mod7_cent, mean_gest, mean_lung, fixed_vals_list)

#Creation of wrap plots
wrap_plots(plots, ncol=2, nrow=2)

The results obtained from the predictive model clearly highlight the most influential maternal and biometric variables affecting newborn weight, providing useful tools for early evaluation of the risk of impaired growth.

Impact of gestation and smoking:

Gestational age is the primary factor determining the progressive increase in fetal weight. However, the distinction between smoking and non-smoking mothers reveals a clear weight gap, with infants of smoking mothers being significantly smaller. This finding confirms the need for targeted preventive actions aimed at smoking cessation during pregnancy to reduce the incidence of intrauterine growth restriction. Moreover, for high-risk cases (e.g., smoking mothers with advanced gestation but low predicted weight), it is essential to schedule closer monitoring and consider early availability of neonatal intensive care.

Role of biometric measurements:

Variables such as length and head circumference represent reliable indicators of overall fetal growth. The linear or near-linear correlations with weight suggest that, when integrated into routine ultrasounds, they can be used to develop personalized weight estimates at term. Such information allows identification of infants potentially at risk of underdevelopment or macrosomia, enabling optimization of clinical decisions regarding delivery and neonatal care.

Influence of maternal factors and sex:

The number of previous pregnancies shows a positive impact on weight, likely due to progressive maternal physiological adaptations. The differentiation by fetal sex, with males generally heavier, underscores the need to consider these characteristics when calculating personalized growth curves. These findings support a differentiated risk assessment for male and female newborns, aiding specific care protocols.

Clinical and operational implications:

Primary prevention: Informing and supporting pregnant women about the risks of smoking is crucial. Smoking cessation strategies should be integrated into prenatal care.
Personalized monitoring: The combined use of gestational age, fetal biometrics, and reproductive history allows adjustment of ultrasound follow-up and timely interventions in cases of possible intrauterine growth restriction or macrosomia.
Neonatal intensive care planning: For infants estimated to have low weight for gestational age or other risk factors, it is advisable to allocate dedicated beds in the neonatal intensive care unit (NICU), improving resource management and prognosis.

#Plot of predicted vs real values of target variable

#Predicted values on original scale (exp of log prediction)
predicted_values <- exp(predict(mod7_cent))

#Create dataframe with index and both observed and predicted weights
df_long <- data.frame(
  Index = 1:nrow(data),
  Observed = data$Peso,
  Predicted = predicted_values
) %>%
  pivot_longer(cols = c("Observed", "Predicted"), 
               names_to = "Type", values_to = "Weight")

#Plot of Weight vs Index, colored by Type (Observed or Predicted)
ggplot(df_long, aes(x = Index, y = Weight, color = Type)) +
  geom_point(alpha = 0.5) +
  scale_color_manual(
    values = c("Observed" = "blue", "Predicted" = "red"),
    labels = c("Observed", "Predicted")
  ) +
  labs(
    title = "Comparison of Observed and Predicted Newborn Weight",
    x = "Observation Index",
    y = "Weight (grams)",
    color = "Data Type"
  ) +
  theme_classic()

The plot shows a good overall correspondence between the observed values (in blue) and the predicted values (in red) from the model across the entire series of observations. It can be seen that the model explains well the variability of most newborn weights within the central range of the distribution.

For extreme values, both at the lower and upper ends of the weight distribution, a greater discrepancy between observed and predicted values is observed. This phenomenon is consistent with the presence of heteroscedasticity highlighted by diagnostic tests, that is, a variation in the dispersion of prediction errors that increases with the expected value.

Moreover, the presence of biologically plausible outliers (such as very small newborns due to prematurity or very large newborns in cases of macrosomia or multiple pregnancies) makes it more difficult for the model to provide precise estimates in these extreme ranges. These cases represent natural clinical variability and not measurement errors, and their influence contributes to greater uncertainty in the predictions.

The practical implications to consider are as follows:

In the extreme weight ranges, it should be recognized that the model’s predictions may be less accurate and should be supplemented with additional clinical assessments.
The model remains a reliable and useful tool in many clinical situations, but recognizing its limitations in these extreme areas allows for its use with awareness.

#Function to plot residuals of model vs groups of categorical variables of model
std_res_plot <- function(data, x, y) {
  ggplot(data, aes(x = {{x}}, y = {{y}}, fill = {{x}})) +
    geom_boxplot() +
    labs(
      title = paste("Standardized Residuals by", deparse(substitute(x))),
      x = deparse(substitute(x)),
      y = deparse(substitute(y))
    ) +
    theme_classic()
}

#Add residuals of model
data$residuals_std <- rstandard(mod7_cent)

#Plot of results
p1 <- std_res_plot(data, Fumatrici, residuals_std)
p2 <- std_res_plot(data, Sesso, residuals_std)

p1 + p2 + plot_layout(ncol = 2)

The analysis of residuals stratified by groups confirms that the model does not introduce systematic bias related to the considered categorical variables (smoking status and sex). The distribution of prediction errors is similar across groups, with medians close to zero and comparable spread of residual variability. The presence of some outliers, evenly distributed among categories, indicates that difficult-to-predict cases are not associated with specific groups but rather reflect natural biological variability. These results support what has been previously discussed, namely that the model performs well on most data and that larger discrepancies mainly occur in correspondence with rare and anomalous but biologically plausible values.

Conclusions

The analysis conducted with the predictive model allows for a reliable estimation of neonatal weight in most cases, providing an objective support for the early assessment of the risk of altered fetal growth. The model highlights the importance of maternal and biometric variables, such as gestational age, smoking status, infant sex, length, and head circumference, in determining the estimated weight.

The residual analysis stratified by smoking status and sex groups showed no systematic bias, confirming that the model maintains good predictive equity across these categories. However, greater uncertainty emerged in the estimation for extreme weight values, i.e., in rare but biologically plausible cases. This uncertainty requires the qualified intervention of clinical expertise, which must validate and integrate the estimates especially in cases of very low or very high predicted weight, or in the presence of significant maternal risk factors.

At the operational level, particular attention is recommended for:

Weight predictions at the lower or upper limits of the distribution, which require further diagnostic investigations and more frequent monitoring.
Cases with known risk factors, such as maternal smoking, which require targeted checks and preventive interventions.
Integration of model data with clinical and instrumental evaluations to support a more informed overall decision.

In terms of model improvement, the systematic collection of longitudinal fetal growth data and the inclusion of additional relevant clinical variables could increase the predictive accuracy and robustness of the model. The collection of repeated biometric parameters over time, as well as more detailed information on maternal conditions, would facilitate the development of dynamic and personalized models, with a potential positive impact on prenatal care pathways.

Statistical model for newborn weight prediction

2026-02-01

Project introduction

Descriptive analysis of dataset

Statistical index calculation

Shapiro-Wilk test for numerical variables

Numerical variable distribution visual analysis

Numerical variable Q-Q plot analysis

Numerical variable outlier analysis

Multivariate analysis of dataset and statistical tests

Mean calculation beetwen groups and relative differences

Statistical tests to investigate differences beetwen groups (focus on Peso and Lunghezza beetwen Sesso)

Statistical test for differences in cesarean deliveries among hospitals

Statistical test for differences of Peso and Lunghezza from population

Correlation analysis

Regression model creation

Best model selection

Model quality analysis

Prediction and results

Visualizations

Conclusions