The project aims to develop a statistical model to accurately predict newborn birth weight using clinical data from three hospitals. This initiative seeks to improve the management of high-risk pregnancies, optimize hospital resources, and promote better neonatal outcomes, within a context of growing attention to the prevention of neonatal complications. Predicting neonatal birth weight is essential for optimal care planning and reducing risks associated with preterm births or low birth weight infants.
The main benefits of the project are:
#Load dataset
data <- read.csv("C:/Users/matmi/OneDrive/Desktop/ProgettiR/neonati.csv")
#First 10 rows of dataset
kable(head(data, 10), caption = "First 10 rows of dataset")
| Anni.madre | N.gravidanze | Fumatrici | Gestazione | Peso | Lunghezza | Cranio | Tipo.parto | Ospedale | Sesso |
|---|---|---|---|---|---|---|---|---|---|
| 26 | 0 | 0 | 42 | 3380 | 490 | 325 | Nat | osp3 | M |
| 21 | 2 | 0 | 39 | 3150 | 490 | 345 | Nat | osp1 | F |
| 34 | 3 | 0 | 38 | 3640 | 500 | 375 | Nat | osp2 | M |
| 28 | 1 | 0 | 41 | 3690 | 515 | 365 | Nat | osp2 | M |
| 20 | 0 | 0 | 38 | 3700 | 480 | 335 | Nat | osp3 | F |
| 32 | 0 | 0 | 40 | 3200 | 495 | 340 | Nat | osp2 | F |
| 26 | 1 | 0 | 39 | 3100 | 480 | 345 | Nat | osp3 | F |
| 25 | 0 | 0 | 40 | 3580 | 510 | 349 | Nat | osp1 | M |
| 22 | 1 | 0 | 40 | 3670 | 500 | 335 | Ces | osp2 | F |
| 23 | 0 | 0 | 41 | 3700 | 510 | 362 | Ces | osp2 | F |
#Summary of dataset
kable(summary(data), caption = "Statistical index summary of dataset")
| Anni.madre | N.gravidanze | Fumatrici | Gestazione | Peso | Lunghezza | Cranio | Tipo.parto | Ospedale | Sesso | |
|---|---|---|---|---|---|---|---|---|---|---|
| Min. : 0.00 | Min. : 0.0000 | Min. :0.0000 | Min. :25.00 | Min. : 830 | Min. :310.0 | Min. :235 | Length:2500 | Length:2500 | Length:2500 | |
| 1st Qu.:25.00 | 1st Qu.: 0.0000 | 1st Qu.:0.0000 | 1st Qu.:38.00 | 1st Qu.:2990 | 1st Qu.:480.0 | 1st Qu.:330 | Class :character | Class :character | Class :character | |
| Median :28.00 | Median : 1.0000 | Median :0.0000 | Median :39.00 | Median :3300 | Median :500.0 | Median :340 | Mode :character | Mode :character | Mode :character | |
| Mean :28.16 | Mean : 0.9812 | Mean :0.0416 | Mean :38.98 | Mean :3284 | Mean :494.7 | Mean :340 | NA | NA | NA | |
| 3rd Qu.:32.00 | 3rd Qu.: 1.0000 | 3rd Qu.:0.0000 | 3rd Qu.:40.00 | 3rd Qu.:3620 | 3rd Qu.:510.0 | 3rd Qu.:350 | NA | NA | NA | |
| Max. :46.00 | Max. :12.0000 | Max. :1.0000 | Max. :43.00 | Max. :4930 | Max. :565.0 | Max. :390 | NA | NA | NA |
attach(data)
To build the predictive model, data were collected on 2,500 newborns from three hospitals. The collected variables include:
The dataset shows the following key characteristics:
The dataset analysis is structured into the following main phases:
#Function for statistical index calculation of each variable
calc_stats <- function(x) {
c(
mean = mean(x, na.rm = TRUE),
median = median(x, na.rm = TRUE),
sd = sd(x, na.rm = TRUE),
skewness = moments::skewness(x, na.rm = TRUE),
kurtosis = moments::kurtosis(x, na.rm = TRUE)
)
}
#Matrix of statistical index for each numeric variable of dataset
numeric_vars <- sapply(data, is.numeric)
numeric_vars <- numeric_vars & sapply(data, function(x) length(unique(na.omit(x))) > 2)
stat_summary <- sapply(data[, numeric_vars], calc_stats)
kable(stat_summary, caption = "Matrix of statistical index for each numeric variable of dataset")
| Anni.madre | N.gravidanze | Gestazione | Peso | Lunghezza | Cranio | |
|---|---|---|---|---|---|---|
| mean | 28.1640000 | 0.981200 | 38.980400 | 3284.0808000 | 494.692000 | 340.0292000 |
| median | 28.0000000 | 1.000000 | 39.000000 | 3300.0000000 | 500.000000 | 340.0000000 |
| sd | 5.2735783 | 1.280587 | 1.868639 | 525.0387442 | 26.318644 | 16.4253299 |
| skewness | 0.0428115 | 2.514254 | -2.065313 | -0.6470308 | -1.514699 | -0.7850527 |
| kurtosis | 3.3804165 | 13.989406 | 11.258150 | 5.0315318 | 9.487174 | 5.9462063 |
The calculated statistics show the following situation for each variable:
This phase aims to assess whether the distributions of the numerical variables in the dataset approximate a normal distribution, a fundamental prerequisite for the use of many parametric statistical techniques, such as the t-test or linear regression. To this end, statistical tests (Shapiro-Wilk) and visual methods, such as density plots compared to the theoretical normal distribution and Q-Q plots (quantile-quantile), were employed. These allow identification of any asymmetries, heavy tails, or deviations from normality.
The results of this analysis will guide the selection of the most appropriate statistical tools and models in the subsequent phases of the project.
#Function for Shapiro-Wilk test
shapiro_summary <- function(x, alpha = 0.05) {
x <- na.omit(x)
if (length(x) < 3) {
return(NA)
}
if (length(x) > 5000) {
x <- sample(x, 5000)
}
test <- shapiro.test(x)
p_value <- test$p.value
result <- ifelse(p_value > alpha, "Fail to reject H0 (normality)", "Reject H0 (not normal)")
return(c(p_value = p_value, conclusion = result))
}
shapiro_results <- lapply(data[, numeric_vars], shapiro_summary)
shapiro_df <- do.call(rbind, lapply(shapiro_results, function(x) {
data.frame(p_value = x["p_value"], conclusion = x["conclusion"])
}))
rownames(shapiro_df) <- names(shapiro_results)
kable(shapiro_df, caption = "Results of Shapiro-Wilk test on each variable of dataset", row.names = TRUE)
| p_value | conclusion | |
|---|---|---|
| Anni.madre | 1.63894746545117e-09 | Reject H0 (not normal) |
| N.gravidanze | 8.89524555042765e-54 | Reject H0 (not normal) |
| Gestazione | 2.6437822902771e-45 | Reject H0 (not normal) |
| Peso | 3.2344586185273e-22 | Reject H0 (not normal) |
| Lunghezza | 3.18605170830064e-36 | Reject H0 (not normal) |
| Cranio | 1.26291692019115e-24 | Reject H0 (not normal) |
The Shapiro-Wilk test indicates that no variable is perfectly normal, but in large samples even small deviations lead to rejecting the normality hypothesis. However, in large samples, parametric tests are generally robust to moderate departures from normality, especially when supported by graphical analyses (to be conducted in the following phases) and by considering the practical relevance of the deviations observed. This strategy allows making the most of the effectiveness and sensitivity of parametric methods to draw solid conclusions, while remaining attentive to potential limitations caused by more pronounced deviations.
#Function for density plot
plot_density <- function(x, main_title = deparse(substitute(x))) {
x <- na.omit(x)
m <- mean(x)
s <- sd(x)
df <- data.frame(value = x)
x_seq <- seq(min(x), max(x), length.out = 1000)
norm_density <- dnorm(x_seq, mean = m, sd = s)
norm_df <- data.frame(x = x_seq, y = norm_density)
ggplot(df, aes(x = value)) +
geom_density(color = "black", fill = "gray80") +
geom_line(data = norm_df, aes(x = x, y = y), color = "red", size = 1) +
labs(title = paste("Density plot of", main_title),
x = "Value",
y = "Density") +
theme_classic() +
theme(plot.title = element_text(hjust = 0.5))
}
#Variables for plot
vars_to_plot <- data[, numeric_vars]
plots_list <- lapply(names(vars_to_plot), function(var_name) {
plot_density(vars_to_plot[[var_name]], main_title = var_name)
})
#Plot of each variable distribution vs normal distribution with same mean and standard deviation
combined_plot <- wrap_plots(plots_list, ncol = 2, nrow = 3)
print(combined_plot)
The visual analysis of the density plots confirms the main characteristics highlighted by the statistical indicators:
#Function for Q-Q plot
qq_plot <- function(x, main_title = deparse(substitute(x))) {
x <- na.omit(x)
df <- data.frame(value = x)
ggplot(df, aes(sample = value)) +
stat_qq(color = "blue")+
stat_qq_line(color = "red")+
labs(title = paste("Q-Q plot of", main_title),
x = "Theoretical quantiles",
y = "Sample quantiles") +
theme_classic()+
theme(plot.title = element_text(hjust = 0.5))
}
qq_list <- lapply(names(vars_to_plot), function(var_name) {
qq_plot(vars_to_plot[[var_name]], main_title = var_name)
})
combined_plot <- wrap_plots(qq_list, ncol = 2, nrow = 3)
print(combined_plot)
The results of the Q-Q plots, which display the quantile-quantile distribution of the variables compared to theoretical normality, confirm the previous observations regarding the closeness of some variables to normality and the more marked deviations of others, especially in the tails and asymmetries.
In this phase, an outlier analysis was performed using the interquartile range (IQR) rule to identify anomalous values that could affect the quality of the analysis. Clinically relevant data, such as cases of prematurity or multiple pregnancies, were retained, while only values deemed implausible were removed to ensure the reliability and representativeness of the sample.
#Function for boxplot
box_plot <- function(x, main_title = deparse(substitute(x))) {
x <- na.omit(x)
df <- data.frame(value = x)
ggplot(df, aes(x = "", y = value)) +
geom_boxplot(fill = "lightblue", color = "darkblue") +
labs(title = paste("Boxplot of", main_title),
y = main_title,
x = "") +
theme_classic() +
theme(axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
plot.title = element_text(hjust = 0.5))
}
box_list <- lapply(names(vars_to_plot), function(var_name) {
box_plot(vars_to_plot[[var_name]], main_title = var_name)
})
combined_plot <- wrap_plots(box_list, ncol = 3, nrow = 2)
print(combined_plot)
#Function to calculate outliers
calc_outliers <- function(x) {
Q1 <- quantile(x, 0.25, na.rm = TRUE)
Q3 <- quantile(x, 0.75, na.rm = TRUE)
IQR <- Q3 - Q1
lower_bound <- Q1 - 1.5 * IQR
upper_bound <- Q3 + 1.5 * IQR
which_outliers <- which(x < lower_bound | x > upper_bound)
if(length(which_outliers) == 0) {
return(NULL)
}
n_outliers <- length(which_outliers)
total <- sum(!is.na(x))
percent_outliers <- 100 * n_outliers / total
conclusion <- ifelse(percent_outliers > 5, "High number of outliers", "Acceptable number of outliers")
return(list(n_outliers = n_outliers, percent_outliers = percent_outliers, conclusion = conclusion))
}
outliers_list <- lapply(data[, numeric_vars], calc_outliers)
outliers_df <- bind_rows(lapply(outliers_list, as.data.frame), .id = "Variable")
kable(outliers_df, caption = "Table of number, percentage and conclusion for outliers of each numerical variable")
| Variable | n_outliers | percent_outliers | conclusion |
|---|---|---|---|
| Anni.madre | 13 | 0.52 | Acceptable number of outliers |
| N.gravidanze | 246 | 9.84 | High number of outliers |
| Gestazione | 67 | 2.68 | Acceptable number of outliers |
| Peso | 69 | 2.76 | Acceptable number of outliers |
| Lunghezza | 59 | 2.36 | Acceptable number of outliers |
| Cranio | 48 | 1.92 | Acceptable number of outliers |
The outlier analysis shows that the number of pregnancies has a significant proportion of extreme values (9.84%) that could influence the model, while mother’s age has few outliers likely due to errors that need correction. The other variables exhibit few anomalies, mainly attributable to clinically valid data (e.g., prematurity), which should be retained. Overall, the data are of good quality and suitable for modeling, with particular attention to managing outliers in “Number of pregnancies” and “Mother’s age.”
#Function to find and remove error in variable Anni.Madre
remove_rows_condition <- function(data, var, condition_func) {
rows_to_remove <- which(condition_func(data[[var]]))
data_filtered <- data[-rows_to_remove, ]
return(data_filtered)
}
data <- remove_rows_condition(data, "Anni.madre", function(x) x < 10)
#Summary of dataset after removing error in variable Anni.Madre
kable(summary(data), caption = "Summary of dataset after removing error in variable Anni.Madre")
| Anni.madre | N.gravidanze | Fumatrici | Gestazione | Peso | Lunghezza | Cranio | Tipo.parto | Ospedale | Sesso | |
|---|---|---|---|---|---|---|---|---|---|---|
| Min. :13.00 | Min. : 0.0000 | Min. :0.00000 | Min. :25.00 | Min. : 830 | Min. :310.0 | Min. :235 | Length:2498 | Length:2498 | Length:2498 | |
| 1st Qu.:25.00 | 1st Qu.: 0.0000 | 1st Qu.:0.00000 | 1st Qu.:38.00 | 1st Qu.:2990 | 1st Qu.:480.0 | 1st Qu.:330 | Class :character | Class :character | Class :character | |
| Median :28.00 | Median : 1.0000 | Median :0.00000 | Median :39.00 | Median :3300 | Median :500.0 | Median :340 | Mode :character | Mode :character | Mode :character | |
| Mean :28.19 | Mean : 0.9816 | Mean :0.04163 | Mean :38.98 | Mean :3284 | Mean :494.7 | Mean :340 | NA | NA | NA | |
| 3rd Qu.:32.00 | 3rd Qu.: 1.0000 | 3rd Qu.:0.00000 | 3rd Qu.:40.00 | 3rd Qu.:3620 | 3rd Qu.:510.0 | 3rd Qu.:350 | NA | NA | NA | |
| Max. :46.00 | Max. :12.0000 | Max. :1.00000 | Max. :43.00 | Max. :4930 | Max. :565.0 | Max. :390 | NA | NA | NA |
An exploratory analysis was conducted by identifying and removing evident errors, such as maternal ages below biologically plausible values, eliminating only a few observations without compromising the overall quality of the dataset. Outliers related to clinical variables such as weight, length, head circumference, and gestation duration were deemed plausible and therefore retained, as were the rare high values in the “N.gravidanze” variable. However, considering the potential influence of these extreme values on the model, it might be useful to further investigate the analysis by applying robust statistical methods or variable transformations in the future, while carefully assessing whether these approaches lead to meaningful improvements without overly complicating the model’s interpretability and parsimony.
Below, a multivariate analysis of the dataset variables is conducted. This analysis aims to examine the distributions of numerical variables in relation to the corresponding levels of categorical variables, in order to identify differences between groups, through:
#Function to calculate mean aggregated by other variables
calc_aggregate <- function(data, num_var, cat_var) {
agg <- data %>%
group_by(across(all_of(cat_var))) %>%
summarise(
mean = mean(.data[[num_var]], na.rm = TRUE),
.groups = "drop"
)%>%
pivot_longer(cols = mean, names_to = "Stat", values_to = "Value") %>%
pivot_wider(names_from = !!sym(cat_var), values_from = Value) %>%
mutate(Variable = num_var, .before = 1)
#Extract only mean
means_row <- agg %>% dplyr::select(-Variable, -Stat)
group_names <- colnames(means_row)
means_values <- as.numeric(means_row[1, ])
#Function to calculate mean differences between pairs of values for each categorical variable
comb_pairs <- combn(seq_along(means_values), 2, simplify = FALSE)
diff_cols <- map(comb_pairs, function(idx) {
i <- idx[1]
j <- idx[2]
diff_val <- means_values[j] - means_values[i]
col_name <- paste0("diff_", group_names[j], "_minus_", group_names[i])
diff_col <- rep(NA_real_, nrow(agg))
media_row_index <- which(agg$Stat == "mean")
diff_col[media_row_index] <- diff_val
list(name = col_name, values = diff_col)
})
for (c in diff_cols) {
agg[[c$name]] <- c$values
}
return(agg)
}
#Definition of categorical variables
categorical_vars <- c("Fumatrici", "Tipo.parto", "Ospedale", "Sesso")
numerical_vars <- names(data)[numeric_vars]
results_list <- list()
#For statement to save all dataset (aggregated index by each categorical variable)
for (cat_var in categorical_vars) {
agg_tables <- lapply(numerical_vars, function(num_var) {
df_agg <- calc_aggregate(data, num_var, cat_var)
})
results_list[[cat_var]] <- bind_rows(agg_tables)
}
#Print of each calculated table
kable(results_list[["Fumatrici"]], caption = "Aggregated stats by Fumatrici")
| Variable | Stat | 0 | 1 | diff_1_minus_0 |
|---|---|---|---|---|
| Anni.madre | mean | 28.1804511 | 28.317308 | 0.1368566 |
| N.gravidanze | mean | 0.9690894 | 1.269231 | 0.3001414 |
| Gestazione | mean | 38.9670008 | 39.269231 | 0.3022299 |
| Peso | mean | 3286.2623225 | 3236.346154 | -49.9161686 |
| Lunghezza | mean | 494.8099415 | 492.067308 | -2.7426338 |
| Cranio | mean | 340.0588972 | 339.346154 | -0.7127434 |
kable(results_list[["Tipo.parto"]], caption = "Aggregated stats by Tipo.parto")
| Variable | Stat | Ces | Nat | diff_Nat_minus_Ces |
|---|---|---|---|---|
| Anni.madre | mean | 28.201923 | 28.1796610 | -0.0222621 |
| N.gravidanze | mean | 1.023352 | 0.9644068 | -0.0589449 |
| Gestazione | mean | 39.023352 | 38.9615819 | -0.0617697 |
| Peso | mean | 3282.046703 | 3285.0632768 | 3.0165735 |
| Lunghezza | mean | 496.365385 | 494.0090395 | -2.3563451 |
| Cranio | mean | 340.012363 | 340.0361582 | 0.0237956 |
kable(results_list[["Ospedale"]], caption = "Aggregated stats by Ospedale")
| Variable | Stat | osp1 | osp2 | osp3 | diff_osp2_minus_osp1 | diff_osp3_minus_osp1 | diff_osp3_minus_osp2 |
|---|---|---|---|---|---|---|---|
| Anni.madre | mean | 28.045343 | 28.1037736 | 28.407674 | 0.0584304 | 0.3623307 | 0.3039003 |
| N.gravidanze | mean | 0.995098 | 0.9245283 | 1.026379 | -0.0705697 | 0.0312809 | 0.1018506 |
| Gestazione | mean | 38.944853 | 38.9693396 | 39.023981 | 0.0244867 | 0.0791279 | 0.0546412 |
| Peso | mean | 3270.265931 | 3270.5070755 | 3311.708633 | 0.2411441 | 41.4427017 | 41.2015576 |
| Lunghezza | mean | 494.120098 | 495.3396226 | 494.604316 | 1.2195246 | 0.4842185 | -0.7353061 |
| Cranio | mean | 339.927696 | 339.8856132 | 340.274580 | -0.0420829 | 0.3468843 | 0.3889671 |
kable(results_list[["Sesso"]], caption = "Aggregated stats by Sesso")
| Variable | Stat | F | M | diff_M_minus_F |
|---|---|---|---|---|
| Anni.madre | mean | 28.1370518 | 28.235720 | 0.0986682 |
| N.gravidanze | mean | 0.9498008 | 1.013677 | 0.0638758 |
| Gestazione | mean | 38.7314741 | 39.230089 | 0.4986144 |
| Peso | mean | 3161.0613546 | 3408.495575 | 247.4342206 |
| Lunghezza | mean | 489.7641434 | 499.674980 | 9.9108365 |
| Cranio | mean | 337.6231076 | 342.458568 | 4.8354604 |
#Function to explore differences beetwen groups for each numerical variable
data$Fumatrici <- factor(data$Fumatrici, levels = c(0, 1), labels = c("NonFumatrici", "Fumatrici"))
categorical_for_test <- c("Fumatrici", "Tipo.parto", "Ospedale", "Sesso")
test_group_diff <- function(data, num_var, cat_var) {
x <- data[[num_var]]
g <- data[[cat_var]]
df <- data.frame(x = x, g = g) %>% filter(!is.na(x), !is.na(g))
n_groups <- length(unique(df$g))
#Normality test for each group
normality <- by(df$x, df$g, function(v) shapiro.test(v)$p.value)
normal <- all(unlist(normality) > 0.05)
#Omogeneity variance test (Levene)
levene_p <- car::leveneTest(x ~ g, data = df)$"Pr(>F)"[1]
variance_homogeneous <- (levene_p > 0.05)
if (n_groups == 2) {
if (normal && variance_homogeneous) {
test <- t.test(x ~ g, data = df, var.equal = TRUE)
test_name <- "t-test (equal variance)"
nparam_test_name <- NA_character_
nparam_pvalue <- NA_real_
} else {
test <- t.test(x ~ g, data = df, var.equal = FALSE)
test_name <- "Welch t-test (different variance)"
nparam_test_name <- "Wilcoxon rank-sum test"
wtest <- wilcox.test(x ~ g, data = df)
nparam_pvalue <- wtest$p.value
}
} else if (n_groups > 2) {
if (normal && variance_homogeneous) {
test <- aov(x ~ g, data = df)
test_name <- "ANOVA (equal variance)"
nparam_test_name <- NA_character_
nparam_pvalue <- NA_real_
} else {
test <- oneway.test(x ~ g, data = df, var.equal = FALSE)
test_name <- "Welch ANOVA (different variance)"
nparam_test_name <- "Kruskal-Wallis test"
kwtest <- kruskal.test(x ~ g, data = df)
nparam_pvalue <- kwtest$p.value
}
} else {
return(NA)
}
if (test_name == "ANOVA (equal variance)") {
param_pvalue <- summary(test)[[1]][["Pr(>F)"]][1]
} else {
param_pvalue <- test$p.value
}
return(list(
test = test_name,
nparam_test = nparam_test_name,
param_pvalue = param_pvalue,
nparam_pvalue = nparam_pvalue
))
}
#Return of test results
results_tests <- expand.grid(num_var = numerical_vars, cat_var = categorical_vars, stringsAsFactors = FALSE)
results_tests <- results_tests %>%
rowwise() %>%
mutate(test_result = list(test_group_diff(data, num_var, cat_var))) %>%
mutate(
test = test_result$test,
param_pvalue = test_result$param_pvalue,
param_significance = case_when(
param_pvalue < 0.001 ~ "***",
param_pvalue < 0.01 ~ "**",
param_pvalue < 0.05 ~ "*",
TRUE ~ ""
),
nparam_test = test_result$nparam_test,
nparam_pvalue = test_result$nparam_pvalue,
nparam_significance = case_when(
nparam_pvalue < 0.001 ~ "***",
nparam_pvalue < 0.01 ~ "**",
nparam_pvalue < 0.05 ~ "*",
TRUE ~ ""
),
p_value_concordance = ifelse((param_pvalue < 0.05 & nparam_pvalue < 0.05) |
(param_pvalue >= 0.05 & nparam_pvalue >= 0.05), "P-value concordance", "P-value discordance")
) %>%
ungroup() %>%
dplyr::select(num_var, cat_var, test, param_pvalue, param_significance, nparam_test, nparam_pvalue, nparam_significance, p_value_concordance)
#Print of results
kable(
results_tests,
caption = "Statistical test results for differences beetwen groups",
col.names = c("Numerical variable", "Categorical variable", "Parametric test", "P-value pametric test", "Significance parametric test", "Not parametric test", "P-value not parametric test", "Significance not parametric test", "P-Value concordance")
)
| Numerical variable | Categorical variable | Parametric test | P-value pametric test | Significance parametric test | Not parametric test | P-value not parametric test | Significance not parametric test | P-Value concordance |
|---|---|---|---|---|---|---|---|---|
| Anni.madre | Fumatrici | Welch t-test (different variance) | 0.8016453 | Wilcoxon rank-sum test | 0.8716883 | P-value concordance | ||
| N.gravidanze | Fumatrici | Welch t-test (different variance) | 0.0188038 | * | Wilcoxon rank-sum test | 0.0022427 | ** | P-value concordance |
| Gestazione | Fumatrici | Welch t-test (different variance) | 0.0388965 | * | Wilcoxon rank-sum test | 0.2098755 | P-value discordance | |
| Peso | Fumatrici | Welch t-test (different variance) | 0.3022785 | Wilcoxon rank-sum test | 0.0592839 | P-value concordance | ||
| Lunghezza | Fumatrici | Welch t-test (different variance) | 0.2100567 | Wilcoxon rank-sum test | 0.0373755 | * | P-value discordance | |
| Cranio | Fumatrici | Welch t-test (different variance) | 0.6149153 | Wilcoxon rank-sum test | 0.3392392 | P-value concordance | ||
| Anni.madre | Tipo.parto | Welch t-test (different variance) | 0.9240376 | Wilcoxon rank-sum test | 0.9481223 | P-value concordance | ||
| N.gravidanze | Tipo.parto | Welch t-test (different variance) | 0.2881163 | Wilcoxon rank-sum test | 0.1196188 | P-value concordance | ||
| Gestazione | Tipo.parto | Welch t-test (different variance) | 0.4343654 | Wilcoxon rank-sum test | 0.9322109 | P-value concordance | ||
| Peso | Tipo.parto | Welch t-test (different variance) | 0.8916349 | Wilcoxon rank-sum test | 0.5243601 | P-value concordance | ||
| Lunghezza | Tipo.parto | Welch t-test (different variance) | 0.0326031 | * | Wilcoxon rank-sum test | 0.0879934 | P-value discordance | |
| Cranio | Tipo.parto | Welch t-test (different variance) | 0.9728416 | Wilcoxon rank-sum test | 0.6432892 | P-value concordance | ||
| Anni.madre | Ospedale | Welch ANOVA (different variance) | 0.3091713 | Kruskal-Wallis test | 0.2952401 | P-value concordance | ||
| N.gravidanze | Ospedale | Welch ANOVA (different variance) | 0.2472083 | Kruskal-Wallis test | 0.0661108 | P-value concordance | ||
| Gestazione | Ospedale | Welch ANOVA (different variance) | 0.6924804 | Kruskal-Wallis test | 0.2977338 | P-value concordance | ||
| Peso | Ospedale | Welch ANOVA (different variance) | 0.1749530 | Kruskal-Wallis test | 0.2164421 | P-value concordance | ||
| Lunghezza | Ospedale | Welch ANOVA (different variance) | 0.6356001 | Kruskal-Wallis test | 0.7351307 | P-value concordance | ||
| Cranio | Ospedale | Welch ANOVA (different variance) | 0.8647781 | Kruskal-Wallis test | 0.9283076 | P-value concordance | ||
| Anni.madre | Sesso | Welch t-test (different variance) | 0.6366427 | Wilcoxon rank-sum test | 0.7286129 | P-value concordance | ||
| N.gravidanze | Sesso | Welch t-test (different variance) | 0.2129632 | Wilcoxon rank-sum test | 0.2907222 | P-value concordance | ||
| Gestazione | Sesso | Welch t-test (different variance) | 0.0000000 | *** | Wilcoxon rank-sum test | 0.0000000 | *** | P-value concordance |
| Peso | Sesso | Welch t-test (different variance) | 0.0000000 | *** | Wilcoxon rank-sum test | 0.0000000 | *** | P-value concordance |
| Lunghezza | Sesso | Welch t-test (different variance) | 0.0000000 | *** | Wilcoxon rank-sum test | 0.0000000 | *** | P-value concordance |
| Cranio | Sesso | Welch t-test (different variance) | 0.0000000 | *** | Wilcoxon rank-sum test | 0.0000000 | *** | P-value concordance |
The objective of the analysis is to identify significant differences between groups in numerical variables by selecting appropriate statistical tests based on the data distributions. Although normality tests indicate deviations, many variables show approximately normal distributions with slight skewness and outliers. For this reason, standard parametric tests are used when normality and homogeneity of variances assumptions are met, supplemented by robust tests (Welch) in cases of heteroscedasticity, and by non-parametric tests (Wilcoxon and Kruskal-Wallis) when parametric assumptions are violated.
The combined application of these methods allows for reliable detection of significant differences between groups, such as the effect of maternal smoking on the number of pregnancies and newborn size, and differences related to sex on gestational duration and anthropometric characteristics.
Specifically, as explicitly requested, it is verified that anthropometric measurements differ significantly between the two sexes. This finding is confirmed by both parametric and non-parametric tests.
These results will be further illustrated in the following boxplots.
#Function for boxplot
box_plot_grouped <- function(data, num_var, group_var, main_title = NULL) {
df <- data %>%
dplyr::select(all_of(c(num_var, group_var))) %>%
filter(!is.na(.data[[num_var]]), !is.na(.data[[group_var]]))
if (is.null(main_title)) {
main_title <- paste("Boxplot of", num_var, "by", group_var)
}
ggplot(df, aes_string(x = group_var, y = num_var, fill = group_var)) +
geom_boxplot() +
labs(title = main_title, x = group_var, y = num_var) +
theme_classic() +
theme(plot.title = element_text(hjust = 0.5))
}
categorical_vars <- c("Fumatrici", "Tipo.parto", "Ospedale", "Sesso")
#Boxplot by group of categorical variables
plots_by_group <- lapply(categorical_vars, function(cat_var) {
p_list <- lapply(numerical_vars, function(num_var) {
box_plot_grouped(data, num_var, cat_var)
})
wrap_plots(p_list, ncol = 2)
})
names(plots_by_group) <- categorical_vars
#Plot of each boxplot
print(plots_by_group[["Fumatrici"]])
print(plots_by_group[["Tipo.parto"]])
print(plots_by_group[["Ospedale"]])
print(plots_by_group[["Sesso"]])
The visual analysis through boxplots supports the statistical results, highlighting significant differences in gestational weeks and newborn length between smokers and non-smokers, as well as pronounced differences by newborn sex in gestational age, weight, length, and head circumference. No significant differences emerge for delivery type or hospital.
As a further analysis, a chi-squared test of independence is conducted to verify whether the distribution of delivery type (natural or cesarean) significantly differs among the three hospitals in the dataset. This test allows the evaluation of whether an association exists between the healthcare facility and the mode of delivery.
calc_tab_cont <- function(var1, var2) {
cont_tab <- table(var1, var2)
return(cont_tab)
}
cont_tab <- calc_tab_cont(data$Tipo.parto, data$Ospedale)
kable(cont_tab, caption = "Contingency table for variables Tipo.parto and Ospedale")
| osp1 | osp2 | osp3 | |
|---|---|---|---|
| Ces | 242 | 254 | 232 |
| Nat | 574 | 594 | 602 |
result_t1 <- chisq.test(cont_tab)
test_ces_df <- data.frame(
Method = result_t1$method,
Statistic = round(as.numeric(result_t1$statistic), 4),
P_Value = round(result_t1$p.value, 4)
)
kable(test_ces_df, caption = "Result of chi-squared test on on the contingency table between Tipo.parto and Ospedale")
| Method | Statistic | P_Value |
|---|---|---|
| Pearson’s Chi-squared test | 1.083 | 0.5819 |
The statistical analysis using the chi-squared test on the contingency table between ‘Tipo.parto’ and ‘Ospedale’ revealed a p-value of 0.578, which is above the conventional significance threshold of 0.05. This indicates that there are no significant differences in the distribution of delivery types (natural or cesarean) among the three hospitals considered. Therefore, in the studied sample, the place of delivery does not appear to influence the frequency of cesarean births.
Moreover, a statistical verification is conducted to assess the difference between the sample means of newborn weight and length and their respective reference values (3300 grams for weight and 500 mm for length). These reference values are taken from the 2024 Cedap report of the Italian Ministry of Health and are derived from the data collected in the homonymous information flows that each Health Organization is required to produce monthly. To this end, two one-sample tests are employed: the Student’s t-test, based on the assumption of normality, and the Wilcoxon test, which is non-parametric and more robust to deviations from the normality assumption.
#Function to test Peso and Lunghezza then same population variable
compare_to_population <- function(x, mu, alpha = 0.05) {
x <- na.omit(x)
#T test
t_res <- t.test(x, mu = mu)
#Wilcoxon test
wilcox_res <- wilcox.test(x, mu = mu)
#Summary of results
result <- data.frame(
parametric_test = "One-sample t-test",
p_value_ptest = t_res$p.value,
conclusion_ptest = ifelse(t_res$p.value < alpha,
"Reject H0. Target variable is statistically different from population variable",
"Fail to Reject H0. Target variable is not statistically different from population variable"),
nparametric_test = "Wilcoxon signed-rank test",
p_value_nptest = wilcox_res$p.value,
conclusion_nptest = ifelse(wilcox_res$p.value < alpha,
"Reject H0. Target variable is statistically different from population variable",
"Fail to Reject H0. Target variable is not statistically different from population variable"),
stringsAsFactors = FALSE
)
return(result)
}
#Test for weight and length
weight_ref <- 3300
length_ref <- 500
weight_test <- compare_to_population(data$Peso, mu = weight_ref)
length_test <- compare_to_population(data$Lunghezza, mu = length_ref)
kable(weight_test, caption = "Summary table of results for compare beetwen Peso of sample and weight of population",
col.names = c("Parametric test", "P-value parametric test", "Conclusion parametric test", "Not parametric test", "P-value not parametric test", "Conclusion not parametric test"))
| Parametric test | P-value parametric test | Conclusion parametric test | Not parametric test | P-value not parametric test | Conclusion not parametric test |
|---|---|---|---|---|---|
| One-sample t-test | 0.1324476 | Fail to Reject H0. Target variable is not statistically different from population variable | Wilcoxon signed-rank test | 0.9482417 | Fail to Reject H0. Target variable is not statistically different from population variable |
kable(length_test, caption = "Summary table of results for compare beetwen Lunghezza of sample and length of population",
col.names = c("Parametric test", "P-value parametric test", "Conclusion parametric test", "Not parametric test", "P-value not parametric test", "Conclusion not parametric test"))
| Parametric test | P-value parametric test | Conclusion parametric test | Not parametric test | P-value not parametric test | Conclusion not parametric test |
|---|---|---|---|---|---|
| One-sample t-test | 0 | Reject H0. Target variable is statistically different from population variable | Wilcoxon signed-rank test | 0 | Reject H0. Target variable is statistically different from population variable |
Both parametric and non-parametric tests confirm that the average newborn weight in the sample does not differ significantly from the reference value of 3300 grams, indicating a good correspondence with the reference population. Conversely, for length, both tests detect a significant difference compared to the 500 mm reference, indicating that the average length in the sample deviates from the expected value. The consistency between the tests is supported by the approximate normality of the distributions, which justifies the use of parametric tests and strengthens the validity of the conclusions through the consistent results from the non-parametric tests.
In this phase, correlation analyses are performed among the dataset variables to identify any linear relationships between them. First, the global relationship among the numerical variables in the dataset is analyzed, followed by the relationship between numerical variables within each group defined by the values of the categorical variables.
#Select only numerical variables
num_data <- data[, numeric_vars]
#Correlation matrix for numerical variable of dataset
ggpairs(
num_data,
lower = list(continuous = wrap("smooth_loess", alpha = 0.3, size = 0.5)),
upper = list(continuous = wrap("cor", size = 5)),
diag = list(continuous = wrap("densityDiag"))
) +
theme_classic() +
ggtitle("Correlation matrix between numerical variable of dataset") +
theme(axis.text.x = element_text(angle = 60, hjust = 1))
#Function to plot correlation matrix among groups
plot_grouped_ggpairs <- function(data, cat_var, numeric_vars) {
groups <- unique(data[[cat_var]])
plots_list <- lapply(groups, function(g) {
df_subset <- data[data[[cat_var]] == g, ]
p <- ggpairs(
df_subset,
columns = which(names(data) %in% numeric_vars),
lower = list(continuous = wrap("smooth_loess", alpha = 0.3)),
upper = list(continuous = wrap("cor", size = 5)),
diag = list(continuous = wrap("densityDiag"))
) + ggtitle(paste("Correlation matrix for", cat_var, "=", g))+
theme_classic() +
theme(axis.text.x = element_text(angle = 60, hjust = 1))
return(p)
})
names(plots_list) <- groups
return(plots_list)
}
numeric_var_names <- names(data)[numeric_vars]
#Plot of all plot for each categorical variable
plots_all_categorical <- lapply(categorical_vars, function(cat_var) {
plot_grouped_ggpairs(data, cat_var, numeric_var_names)
})
#Plot of correlation matrix for variable 'Fumatrici'
print(plots_all_categorical[[1]])
#Plot of correlation matrix for variable 'Tipo.parto'
print(plots_all_categorical[[2]])
#Plot of correlation matrix for variable 'Ospedale'
print(plots_all_categorical[[3]])
#Plot of correlation matrix for variable 'Sesso'
print(plots_all_categorical[[4]])
The descriptive and correlation analysis highlighted that variables related to gestation and neonatal anthropometric measurements are strongly associated with newborn weight and represent key potential predictors. Categorical variables such as maternal smoking, delivery type, newborn sex, and hospital facility influence these relationships and their inclusion in the predictive model should be evaluated.
In particular, the attenuated effects observed in the subgroup of smoking mothers suggest considering interaction terms or differential effects in the model to better capture the dynamics present in the sample. Furthermore, the strong correlation among some anthropometric measurements requires appropriate handling of multicollinearity.
Therefore, the multiple regression model to be developed should include both relevant continuous and categorical variables, assess possible interactions, and implement suitable strategies to address multicollinearity and outliers, in order to ensure reliable and interpretable estimates for the prediction of newborn weight.
In this phase of the project, a multiple linear regression model is developed to quantify the effect of the available variables on newborn birth weight. To this end, a procedure has been implemented that systematically presents the main results of the model, including:
#Creation of regression model with backward approach
#Function for model results
mod_results <- function(mod) {
summ <- summary(mod)
vif_values <- car::vif(mod)
bp_test <- lmtest::bptest(mod)
dw_test <- lmtest::dwtest(mod)
shapiro_test <- shapiro.test(residuals(mod))
#Coefficients table with p-value
coefs <- as.data.frame(summ$coefficients)
coefs <- round(coefs, 4)
rownames(coefs) <- rownames(summ$coefficients)
colnames(coefs) <- c("Estimate", "Std. Error", "t value", "Pr(>|t|)")
print(kable(coefs, caption = "Coefficients of the Regression Model")%>%
kableExtra::kable_styling(full_width = FALSE, position = "left"))
#VIF table
if (is.matrix(vif_values)) {
vif_tab <- data.frame(
Variable = rownames(vif_values),
GVIF = round(vif_values[,"GVIF"], 3),
GVIF_corr = round(vif_values[,"GVIF^(1/(2*Df))"], 3)
)
} else {
vif_tab <- data.frame(
Variable = names(vif_values),
VIF = round(vif_values, 3)
)
}
print(kable(vif_tab, caption = "Variance Inflation Factor (VIF)")%>%
kableExtra::kable_styling(full_width = FALSE, position = "left"))
#Diagnostic test table
diag_tab <- data.frame(
Test = c("Breusch-Pagan (Heteroscedasticity)",
"Durbin-Watson (Autocorrelation)",
"Shapiro-Wilk (Normality of Residuals)"),
Statistic = c(round(bp_test$statistic, 3),
round(dw_test$statistic, 3),
round(shapiro_test$statistic, 3)),
p_value = c(format.pval(bp_test$p.value, digits = 3, eps = 1e-3),
format.pval(dw_test$p.value, digits = 3, eps = 1e-3),
format.pval(shapiro_test$p.value, digits = 3, eps = 1e-3)),
stringsAsFactors = FALSE
)
print(kable(diag_tab, caption = "Diagnostic Tests Summary")%>%
kableExtra::kable_styling(full_width = FALSE, position = "left"))
#Residuals plots
par(mfrow = c(2,2))
plot(mod)
invisible(list(summary = summ,
vif = vif_values,
bptest = bp_test,
dwtest = dw_test,
shapiro = shapiro_test))
}
#Transform also other categorical variables in factor
data$Sesso <- factor(data$Sesso, levels = c("M", "F"))
data$Tipo.parto <- factor(data$Tipo.parto, levels = c("Nat", "Ces"))
data$Ospedale <- factor(data$Ospedale, levels = c("osp1", "osp2", "osp3"))
#Model1: all variables
mod1 <- lm(Peso ~., data = data)
mod_results(mod1)
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | -6628.5902 | 143.0078 | -46.3512 | 0.0000 |
| Anni.madre | 0.8018 | 1.1467 | 0.6992 | 0.4845 |
| N.gravidanze | 11.3812 | 4.6686 | 2.4378 | 0.0148 |
| FumatriciFumatrici | -30.2741 | 27.5492 | -1.0989 | 0.2719 |
| Gestazione | 32.5773 | 3.8208 | 8.5264 | 0.0000 |
| Lunghezza | 10.2922 | 0.3009 | 34.2070 | 0.0000 |
| Cranio | 10.4722 | 0.4263 | 24.5671 | 0.0000 |
| Tipo.partoCes | -29.6335 | 12.0905 | -2.4510 | 0.0143 |
| Ospedaleosp2 | -11.0912 | 13.4471 | -0.8248 | 0.4096 |
| Ospedaleosp3 | 28.2495 | 13.5054 | 2.0917 | 0.0366 |
| SessoF | -77.5723 | 11.1865 | -6.9344 | 0.0000 |
| Variable | GVIF | GVIF_corr | |
|---|---|---|---|
| Anni.madre | Anni.madre | 1.190 | 1.091 |
| N.gravidanze | N.gravidanze | 1.189 | 1.091 |
| Fumatrici | Fumatrici | 1.007 | 1.004 |
| Gestazione | Gestazione | 1.696 | 1.302 |
| Lunghezza | Lunghezza | 2.087 | 1.445 |
| Cranio | Cranio | 1.631 | 1.277 |
| Tipo.parto | Tipo.parto | 1.004 | 1.002 |
| Ospedale | Ospedale | 1.004 | 1.001 |
| Sesso | Sesso | 1.041 | 1.020 |
| Test | Statistic | p_value | |
|---|---|---|---|
| BP | Breusch-Pagan (Heteroscedasticity) | 93.294 | <0.001 |
| DW | Durbin-Watson (Autocorrelation) | 1.952 | 0.114 |
| W | Shapiro-Wilk (Normality of Residuals) | 0.974 | <0.001 |
In the initial full model, which includes all available variables, certain features such as gestational age, length, and head circumference are strongly associated with birth weight, highlighting their key role in neonatal growth.
The number of previous pregnancies shows a positive but moderate effect, suggesting that maternal reproductive history may influence the baby’s weight.
Male sex is associated with a higher birth weight, consistent with established biological knowledge.
Conversely, variables such as the hospital of birth and the type of delivery, although included in the initial model, show an overall weak and non-significant effect. This suggests that local differences between hospital groups or delivery methods may have a limited impact on the overall model.
In this phase, starting from the model containing all variables in the dataset, a backward selection approach is adopted to identify the most parsimonious and adequate multiple regression model. The process unfolds through the following steps:
#Function for ANOVA, AIC and BIC test for models
compare_models <- function(mod_prev, mod_new) {
#Extract terms of two models
terms_prev <- attr(terms(mod_prev), "term.labels")
terms_new <- attr(terms(mod_new), "term.labels")
#Verify if the new model is nested
is_nested <- all(terms_new %in% terms_prev)
#Calculate AIC and BIC
aic_tab <- AIC(mod_prev, mod_new)
bic_tab <- BIC(mod_prev, mod_new)
print(knitr::kable(aic_tab, caption = "Method AIC for best model chose") %>%
kableExtra::kable_styling(full_width = FALSE, position = "left"))
print(knitr::kable(bic_tab, caption = "Method BIC for best model chose") %>%
kableExtra::kable_styling(full_width = FALSE, position = "left"))
if (is_nested) {
cat("Models are nested. Performing ANOVA test.\n")
anova_res <- anova(mod_prev, mod_new)
print(knitr::kable(anova_res, caption = "Test ANOVA results") %>%
kableExtra::kable_styling(full_width = FALSE, position = "left"))
} else {
cat("Models are NOT nested. ANOVA test skipped.\n")
}
}
#Model2: all variables minus Anni.madre (not significant variable)
mod2 <- update(mod1, ~.-Anni.madre)
mod_results(mod2)
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | -6601.3083 | 137.5680 | -47.9858 | 0.0000 |
| N.gravidanze | 12.5833 | 4.3400 | 2.8994 | 0.0038 |
| FumatriciFumatrici | -30.4268 | 27.5455 | -1.1046 | 0.2694 |
| Gestazione | 32.2996 | 3.7997 | 8.5006 | 0.0000 |
| Lunghezza | 10.2916 | 0.3008 | 34.2087 | 0.0000 |
| Cranio | 10.4874 | 0.4257 | 24.6375 | 0.0000 |
| Tipo.partoCes | -29.6654 | 12.0892 | -2.4539 | 0.0142 |
| Ospedaleosp2 | -10.9509 | 13.4442 | -0.8145 | 0.4154 |
| Ospedaleosp3 | 28.5171 | 13.4986 | 2.1126 | 0.0347 |
| SessoF | -77.6452 | 11.1849 | -6.9420 | 0.0000 |
| Variable | GVIF | GVIF_corr | |
|---|---|---|---|
| N.gravidanze | N.gravidanze | 1.028 | 1.014 |
| Fumatrici | Fumatrici | 1.007 | 1.004 |
| Gestazione | Gestazione | 1.677 | 1.295 |
| Lunghezza | Lunghezza | 2.087 | 1.445 |
| Cranio | Cranio | 1.627 | 1.275 |
| Tipo.parto | Tipo.parto | 1.004 | 1.002 |
| Ospedale | Ospedale | 1.003 | 1.001 |
| Sesso | Sesso | 1.041 | 1.020 |
| Test | Statistic | p_value | |
|---|---|---|---|
| BP | Breusch-Pagan (Heteroscedasticity) | 91.153 | <0.001 |
| DW | Durbin-Watson (Autocorrelation) | 1.953 | 0.12 |
| W | Shapiro-Wilk (Normality of Residuals) | 0.974 | <0.001 |
compare_models(mod1, mod2)
| df | AIC | |
|---|---|---|
| mod_prev | 12 | 35145.57 |
| mod_new | 11 | 35144.06 |
| df | BIC | |
|---|---|---|
| mod_prev | 12 | 35215.45 |
| mod_new | 11 | 35208.12 |
| Res.Df | RSS | Df | Sum of Sq | F | Pr(>F) |
|---|---|---|---|---|---|
| 2487 | 186743194 | NA | NA | NA | NA |
| 2488 | 186779904 | -1 | -36710 | 0.4888948 | 0.4844861 |
#Model3: all variables minus Ospedale (not significant variable)
mod3 <- update(mod2, ~.-Ospedale)
mod_results(mod3)
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | -6600.2816 | 137.6162 | -47.9615 | 0.0000 |
| N.gravidanze | 12.9927 | 4.3439 | 2.9910 | 0.0028 |
| FumatriciFumatrici | -31.8823 | 27.5803 | -1.1560 | 0.2478 |
| Gestazione | 32.5970 | 3.8039 | 8.5693 | 0.0000 |
| Lunghezza | 10.2684 | 0.3011 | 34.0985 | 0.0000 |
| Cranio | 10.5015 | 0.4262 | 24.6372 | 0.0000 |
| Tipo.partoCes | -30.4244 | 12.1041 | -2.5136 | 0.0120 |
| SessoF | -78.1031 | 11.1998 | -6.9736 | 0.0000 |
| Variable | VIF | |
|---|---|---|
| N.gravidanze | N.gravidanze | 1.027 |
| Fumatrici | Fumatrici | 1.007 |
| Gestazione | Gestazione | 1.676 |
| Lunghezza | Lunghezza | 2.085 |
| Cranio | Cranio | 1.626 |
| Tipo.parto | Tipo.parto | 1.004 |
| Sesso | Sesso | 1.040 |
| Test | Statistic | p_value | |
|---|---|---|---|
| BP | Breusch-Pagan (Heteroscedasticity) | 90.431 | <0.001 |
| DW | Durbin-Watson (Autocorrelation) | 1.954 | 0.124 |
| W | Shapiro-Wilk (Normality of Residuals) | 0.974 | <0.001 |
compare_models(mod2, mod3)
| df | AIC | |
|---|---|---|
| mod_prev | 11 | 35144.06 |
| mod_new | 9 | 35149.33 |
| df | BIC | |
|---|---|---|
| mod_prev | 11 | 35208.12 |
| mod_new | 9 | 35201.73 |
| Res.Df | RSS | Df | Sum of Sq | F | Pr(>F) |
|---|---|---|---|---|---|
| 2488 | 186779904 | NA | NA | NA | NA |
| 2490 | 187473818 | -2 | -693913.8 | 4.621636 | 0.0099213 |
#Model4: log(Peso) for manage eteroschedasticity shown in residuals plot of mod3
mod4 <- lm(log(Peso) ~ N.gravidanze + Fumatrici + Gestazione + Lunghezza +
Cranio + Tipo.parto + Sesso, data = data)
mod_results(mod4)
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | 4.4198 | 0.0434 | 101.8054 | 0.0000 |
| N.gravidanze | 0.0041 | 0.0014 | 3.0097 | 0.0026 |
| FumatriciFumatrici | -0.0065 | 0.0087 | -0.7443 | 0.4568 |
| Gestazione | 0.0186 | 0.0012 | 15.5179 | 0.0000 |
| Lunghezza | 0.0035 | 0.0001 | 37.2147 | 0.0000 |
| Cranio | 0.0035 | 0.0001 | 26.1255 | 0.0000 |
| Tipo.partoCes | -0.0070 | 0.0038 | -1.8417 | 0.0656 |
| SessoF | -0.0181 | 0.0035 | -5.1175 | 0.0000 |
| Variable | VIF | |
|---|---|---|
| N.gravidanze | N.gravidanze | 1.027 |
| Fumatrici | Fumatrici | 1.007 |
| Gestazione | Gestazione | 1.676 |
| Lunghezza | Lunghezza | 2.085 |
| Cranio | Cranio | 1.626 |
| Tipo.parto | Tipo.parto | 1.004 |
| Sesso | Sesso | 1.040 |
| Test | Statistic | p_value | |
|---|---|---|---|
| BP | Breusch-Pagan (Heteroscedasticity) | 169.739 | <0.001 |
| DW | Durbin-Watson (Autocorrelation) | 1.945 | 0.0848 |
| W | Shapiro-Wilk (Normality of Residuals) | 0.974 | <0.001 |
compare_models(mod3, mod4)
| df | AIC | |
|---|---|---|
| mod_prev | 9 | 35149.325 |
| mod_new | 9 | -5125.624 |
| df | BIC | |
|---|---|---|
| mod_prev | 9 | 35201.734 |
| mod_new | 9 | -5073.215 |
| Df | Sum Sq | Mean Sq | F value | Pr(>F) | |
|---|---|---|---|---|---|
| N.gravidanze | 1 | 3.571804e+03 | 3.571804e+03 | 0.0474402 | 0.8275968 |
| Fumatrici | 1 | 2.516883e+05 | 2.516883e+05 | 3.3428881 | 0.0676156 |
| Gestazione | 1 | 2.450098e+08 | 2.450098e+08 | 3254.1845859 | 0.0000000 |
| Lunghezza | 1 | 2.048849e+08 | 2.048849e+08 | 2721.2517384 | 0.0000000 |
| Cranio | 1 | 4.707085e+07 | 4.707085e+07 | 625.1882273 | 0.0000000 |
| Tipo.parto | 1 | 4.810017e+05 | 4.810017e+05 | 6.3885943 | 0.0115470 |
| Sesso | 1 | 3.661486e+06 | 3.661486e+06 | 48.6313237 | 0.0000000 |
| Residuals | 2490 | 1.874738e+08 | 7.529069e+04 | NA | NA |
#Model5: interaction beetwen Fumatrici and Lunghezza + Fumatrici and Gestazione
mod5 <- lm(log(Peso) ~ N.gravidanze + Fumatrici * Lunghezza + Fumatrici * Gestazione + Cranio + Tipo.parto + Sesso, data = data)
mod_results(mod5)
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | 4.4134 | 0.0437 | 101.0835 | 0.0000 |
| N.gravidanze | 0.0043 | 0.0014 | 3.1212 | 0.0018 |
| FumatriciFumatrici | 0.3322 | 0.2556 | 1.2996 | 0.1939 |
| Lunghezza | 0.0035 | 0.0001 | 36.4149 | 0.0000 |
| Gestazione | 0.0193 | 0.0012 | 15.8743 | 0.0000 |
| Cranio | 0.0035 | 0.0001 | 26.1248 | 0.0000 |
| Tipo.partoCes | -0.0073 | 0.0038 | -1.9213 | 0.0548 |
| SessoF | -0.0183 | 0.0035 | -5.1684 | 0.0000 |
| FumatriciFumatrici:Lunghezza | 0.0011 | 0.0005 | 2.2917 | 0.0220 |
| FumatriciFumatrici:Gestazione | -0.0219 | 0.0070 | -3.1418 | 0.0017 |
| Variable | VIF | |
|---|---|---|
| N.gravidanze | N.gravidanze | 1.028 |
| Fumatrici | Fumatrici | 872.074 |
| Lunghezza | Lunghezza | 2.138 |
| Gestazione | Gestazione | 1.730 |
| Cranio | Cranio | 1.627 |
| Tipo.parto | Tipo.parto | 1.004 |
| Sesso | Sesso | 1.043 |
| Fumatrici:Lunghezza | Fumatrici:Lunghezza | 695.009 |
| Fumatrici:Gestazione | Fumatrici:Gestazione | 1004.897 |
| Test | Statistic | p_value | |
|---|---|---|---|
| BP | Breusch-Pagan (Heteroscedasticity) | 176.993 | <0.001 |
| DW | Durbin-Watson (Autocorrelation) | 1.949 | 0.0997 |
| W | Shapiro-Wilk (Normality of Residuals) | 0.975 | <0.001 |
compare_models(mod4, mod5)
| df | AIC | |
|---|---|---|
| mod_prev | 9 | -5125.624 |
| mod_new | 11 | -5132.234 |
| df | BIC | |
|---|---|---|
| mod_prev | 9 | -5073.215 |
| mod_new | 11 | -5068.178 |
Models are NOT nested. ANOVA test skipped.
#Model5 centered: centering of Lunghezza and Gestazione to manage multicollinearity
data$Lunghezza_c <- scale(data$Lunghezza, scale=FALSE)
data$Gestazione_c <- scale(data$Gestazione, scale=FALSE)
mod5_cent <- lm(log(Peso) ~ N.gravidanze + Fumatrici * Gestazione_c + Fumatrici*Lunghezza_c + Cranio + Tipo.parto + Sesso, data=data)
mod_results(mod5_cent)
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | 6.8967 | 0.0457 | 150.8700 | 0.0000 |
| N.gravidanze | 0.0043 | 0.0014 | 3.1212 | 0.0018 |
| FumatriciFumatrici | 0.0023 | 0.0091 | 0.2512 | 0.8017 |
| Gestazione_c | 0.0193 | 0.0012 | 15.8743 | 0.0000 |
| Lunghezza_c | 0.0035 | 0.0001 | 36.4149 | 0.0000 |
| Cranio | 0.0035 | 0.0001 | 26.1248 | 0.0000 |
| Tipo.partoCes | -0.0073 | 0.0038 | -1.9213 | 0.0548 |
| SessoF | -0.0183 | 0.0035 | -5.1684 | 0.0000 |
| FumatriciFumatrici:Gestazione_c | -0.0219 | 0.0070 | -3.1418 | 0.0017 |
| FumatriciFumatrici:Lunghezza_c | 0.0011 | 0.0005 | 2.2917 | 0.0220 |
| Variable | VIF | |
|---|---|---|
| N.gravidanze | N.gravidanze | 1.028 |
| Fumatrici | Fumatrici | 1.105 |
| Gestazione_c | Gestazione_c | 1.730 |
| Lunghezza_c | Lunghezza_c | 2.138 |
| Cranio | Cranio | 1.627 |
| Tipo.parto | Tipo.parto | 1.004 |
| Sesso | Sesso | 1.043 |
| Fumatrici:Gestazione_c | Fumatrici:Gestazione_c | 1.416 |
| Fumatrici:Lunghezza_c | Fumatrici:Lunghezza_c | 1.387 |
| Test | Statistic | p_value | |
|---|---|---|---|
| BP | Breusch-Pagan (Heteroscedasticity) | 176.993 | <0.001 |
| DW | Durbin-Watson (Autocorrelation) | 1.949 | 0.0997 |
| W | Shapiro-Wilk (Normality of Residuals) | 0.975 | <0.001 |
compare_models(mod5, mod5_cent)
| df | AIC | |
|---|---|---|
| mod_prev | 11 | -5132.234 |
| mod_new | 11 | -5132.234 |
| df | BIC | |
|---|---|---|
| mod_prev | 11 | -5068.178 |
| mod_new | 11 | -5068.178 |
Models are NOT nested. ANOVA test skipped.
#Model6: interaction beetwen Gestazione and Lunghezza
mod6_cent <- lm(log(Peso) ~ N.gravidanze + Fumatrici * Gestazione_c + Fumatrici*Lunghezza_c + Cranio + Sesso + Gestazione_c*Lunghezza_c, data=data)
mod_results(mod6_cent)
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | 6.9523 | 0.0448 | 155.3155 | 0.0000 |
| N.gravidanze | 0.0038 | 0.0013 | 2.8658 | 0.0042 |
| FumatriciFumatrici | -0.0002 | 0.0089 | -0.0178 | 0.9858 |
| Gestazione_c | 0.0132 | 0.0013 | 10.1521 | 0.0000 |
| Lunghezza_c | 0.0033 | 0.0001 | 34.2855 | 0.0000 |
| Cranio | 0.0034 | 0.0001 | 25.5879 | 0.0000 |
| SessoF | -0.0222 | 0.0035 | -6.4110 | 0.0000 |
| FumatriciFumatrici:Gestazione_c | -0.0186 | 0.0068 | -2.7279 | 0.0064 |
| FumatriciFumatrici:Lunghezza_c | 0.0012 | 0.0005 | 2.6452 | 0.0082 |
| Gestazione_c:Lunghezza_c | -0.0002 | 0.0000 | -11.7191 | 0.0000 |
| Variable | VIF | |
|---|---|---|
| N.gravidanze | N.gravidanze | 1.028 |
| Fumatrici | Fumatrici | 1.106 |
| Gestazione_c | Gestazione_c | 2.070 |
| Lunghezza_c | Lunghezza_c | 2.220 |
| Cranio | Cranio | 1.641 |
| Sesso | Sesso | 1.053 |
| Fumatrici:Gestazione_c | Fumatrici:Gestazione_c | 1.417 |
| Fumatrici:Lunghezza_c | Fumatrici:Lunghezza_c | 1.388 |
| Gestazione_c:Lunghezza_c | Gestazione_c:Lunghezza_c | 1.734 |
| Test | Statistic | p_value | |
|---|---|---|---|
| BP | Breusch-Pagan (Heteroscedasticity) | 147.794 | <0.001 |
| DW | Durbin-Watson (Autocorrelation) | 1.946 | 0.0879 |
| W | Shapiro-Wilk (Normality of Residuals) | 0.975 | <0.001 |
compare_models(mod5_cent, mod6_cent)
| df | AIC | |
|---|---|---|
| mod_prev | 11 | -5132.234 |
| mod_new | 11 | -5262.748 |
| df | BIC | |
|---|---|---|
| mod_prev | 11 | -5068.178 |
| mod_new | 11 | -5198.693 |
Models are NOT nested. ANOVA test skipped.
#Model7: add of non linear term of Gestazione and eliminate interaction beetwen Gestazione and Lunghezza (for reduce multicollinearity)
data$Gestazione_c2 <- data$Gestazione_c^2
data$Lunghezza_c2 <- data$Lunghezza_c^2
mod7_cent <- lm(log(Peso) ~ N.gravidanze + Fumatrici*Gestazione_c + Gestazione_c2 +
Fumatrici*Lunghezza_c + Lunghezza_c +
Cranio + Sesso, data=data)
mod_results(mod7_cent)
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | 6.9544 | 0.0447 | 155.4749 | 0.0000 |
| N.gravidanze | 0.0041 | 0.0013 | 3.0477 | 0.0023 |
| FumatriciFumatrici | -0.0011 | 0.0089 | -0.1238 | 0.9015 |
| Gestazione_c | 0.0117 | 0.0013 | 8.7372 | 0.0000 |
| Gestazione_c2 | -0.0024 | 0.0002 | -11.9303 | 0.0000 |
| Lunghezza_c | 0.0033 | 0.0001 | 35.1399 | 0.0000 |
| Cranio | 0.0034 | 0.0001 | 25.6199 | 0.0000 |
| SessoF | -0.0216 | 0.0034 | -6.2728 | 0.0000 |
| FumatriciFumatrici:Gestazione_c | -0.0152 | 0.0068 | -2.2305 | 0.0258 |
| FumatriciFumatrici:Lunghezza_c | 0.0011 | 0.0005 | 2.4925 | 0.0127 |
| Variable | VIF | |
|---|---|---|
| N.gravidanze | N.gravidanze | 1.028 |
| Fumatrici | Fumatrici | 1.106 |
| Gestazione_c | Gestazione_c | 2.228 |
| Gestazione_c2 | Gestazione_c2 | 1.837 |
| Lunghezza_c | Lunghezza_c | 2.183 |
| Cranio | Cranio | 1.641 |
| Sesso | Sesso | 1.051 |
| Fumatrici:Gestazione_c | Fumatrici:Gestazione_c | 1.424 |
| Fumatrici:Lunghezza_c | Fumatrici:Lunghezza_c | 1.387 |
| Test | Statistic | p_value | |
|---|---|---|---|
| BP | Breusch-Pagan (Heteroscedasticity) | 158.233 | <0.001 |
| DW | Durbin-Watson (Autocorrelation) | 1.951 | 0.111 |
| W | Shapiro-Wilk (Normality of Residuals) | 0.978 | <0.001 |
compare_models(mod6_cent, mod7_cent)
| df | AIC | |
|---|---|---|
| mod_prev | 11 | -5262.748 |
| mod_new | 11 | -5267.497 |
| df | BIC | |
|---|---|---|
| mod_prev | 11 | -5198.693 |
| mod_new | 11 | -5203.442 |
Models are NOT nested. ANOVA test skipped.
#Resume of best model selection with AIC and BIC
knitr::kable(AIC(mod1, mod2, mod3, mod4, mod5, mod5_cent, mod6_cent, mod7_cent), caption = "Resume of results for method AIC for best model")
| df | AIC | |
|---|---|---|
| mod1 | 12 | 35145.571 |
| mod2 | 11 | 35144.062 |
| mod3 | 9 | 35149.325 |
| mod4 | 9 | -5125.624 |
| mod5 | 11 | -5132.234 |
| mod5_cent | 11 | -5132.234 |
| mod6_cent | 11 | -5262.748 |
| mod7_cent | 11 | -5267.497 |
knitr::kable(BIC(mod1, mod2, mod3, mod4, mod5, mod5_cent, mod6_cent, mod7_cent), caption = "Resume of results for method BIC for best model")
| df | BIC | |
|---|---|---|
| mod1 | 12 | 35215.450 |
| mod2 | 11 | 35208.118 |
| mod3 | 9 | 35201.734 |
| mod4 | 9 | -5073.215 |
| mod5 | 11 | -5068.178 |
| mod5_cent | 11 | -5068.178 |
| mod6_cent | 11 | -5198.693 |
| mod7_cent | 11 | -5203.442 |
The final model demonstrates a good balance between simplicity and predictive performance, as highlighted by the AIC, BIC criteria, and coefficient tests. Diagnostic analysis verifies the fundamental assumptions of linear regression and indicates:
The most relevant variables for predicting neonatal weight include gestational duration, which exhibits a non-linear effect with a more marked increase in the early stages followed by stabilization, head circumference, male sex, and the number of previous pregnancies. Additionally, maternal smoking interacts complexly with gestational age and newborn length, highlighting the importance of considering such interactions in the model.
This summary provides a robust and interpretable model that faithfully describes the data dynamics and lays solid foundations for further considerations.
To ensure the model’s robustness and reliability, an analysis is conducted to identify outliers, high-leverage points, and influential observations that could disproportionately impact the estimates. This step is particularly important in light of the heteroscedasticity revealed by the preliminary diagnostics.
The investigation relies on established metrics such as leverage, studentized residuals, and Cook’s distance, complemented by graphical visualizations and tables of the most relevant observations. This approach allows distinguishing between anomalies or errors and biologically plausible but rare values, thus supporting informed decisions on potential model refinements or further analyses.
#Function to diagnosis of outlier and leverage points
diagnostic_plots_ggplot <- function(mod) {
n <- length(residuals(mod))
p <- length(coefficients(mod))
data_diag <- data.frame(
Index = 1:n,
Leverage = hatvalues(mod),
RStudent = rstudent(mod),
CooksDistance = cooks.distance(mod)
)
#Cutoff for influential points
leverage_cutoff <- 2 * p / n
rstudent_cutoff <- 2
cook_cutoff <- 4 / (n - p)
#Flag influential points
data_diag <- data_diag %>%
mutate(
HighLeverage = Leverage > leverage_cutoff,
Outlier = abs(RStudent) > rstudent_cutoff,
Influential = CooksDistance > cook_cutoff
)
#Function to general plot
plot_diag <- function(df, yvar, ylab, cutoff, flagvar) {
ggplot(df, aes(x = Index, y = !!sym(yvar))) +
geom_point(aes(color = !!sym(flagvar))) +
geom_line() +
geom_hline(yintercept = cutoff, linetype = "dashed", color = "red") +
scale_color_manual(values = c("FALSE" = "black", "TRUE" = "red")) +
labs(title = paste(ylab, "with Influential Points Highlighted"),
y = ylab, x = "Index") +
theme_classic()
}
#Plots
p1 <- plot_diag(data_diag, "Leverage", "Leverage (hat values)", leverage_cutoff, "HighLeverage")
p2 <- plot_diag(data_diag, "RStudent", "Residuals student", rstudent_cutoff, "Outlier")
p3 <- plot_diag(data_diag, "CooksDistance", "Cook's distance", cook_cutoff, "Influential")
#Influential points table
influential_points <- data_diag %>%
filter(HighLeverage | Outlier | Influential) %>%
arrange(desc(CooksDistance)) %>%
dplyr::select(Index, Leverage, RStudent, CooksDistance, HighLeverage, Outlier, Influential)
#table_kable <- knitr::kable(influential_points, caption = "Table of influential points")
#Return plots
return(list(LeveragePlot = p1, ResidualsPlot = p2, CookPlot = p3, InfluentialPoints = influential_points))
}
res <- diagnostic_plots_ggplot(mod7_cent)
#Plots of outlier, leverage and Cook's distance
print(res$LeveragePlot)
print(res$ResidualsPlot)
print(res$CookPlot)
#Extract of influential data
influential_indices <- res$InfluentialPoints$Index
influential_data <- data[influential_indices, 1:10]
influential_data$Index <- influential_indices
#Influential points of dataset
diagnostics_info <- res$InfluentialPoints %>%
dplyr::select(Index, HighLeverage, Outlier, Influential)
#Built of join
final_influential <- inner_join(influential_data, diagnostics_info, by = "Index")
kable(final_influential, caption = "Data and diagnosis of influential points") %>%
kable_styling(full_width = FALSE, position = "left")
| Anni.madre | N.gravidanze | Fumatrici | Gestazione | Peso | Lunghezza | Cranio | Tipo.parto | Ospedale | Sesso | Index | HighLeverage | Outlier | Influential |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 35 | 1 | NonFumatrici | 38 | 4370 | 315 | 374 | Nat | osp3 | F | 1549 | TRUE | TRUE | TRUE |
| 25 | 2 | NonFumatrici | 25 | 900 | 325 | 253 | Nat | osp3 | F | 1778 | TRUE | TRUE | TRUE |
| 24 | 4 | NonFumatrici | 29 | 1280 | 390 | 355 | Nat | osp1 | F | 1427 | TRUE | TRUE | TRUE |
| 26 | 0 | Fumatrici | 39 | 4930 | 550 | 350 | Ces | osp2 | F | 1918 | TRUE | FALSE | TRUE |
| 40 | 3 | NonFumatrici | 28 | 1560 | 420 | 379 | Nat | osp3 | F | 310 | TRUE | TRUE | TRUE |
| 28 | 0 | NonFumatrici | 26 | 930 | 345 | 245 | Ces | osp3 | F | 2450 | TRUE | FALSE | TRUE |
| 30 | 0 | NonFumatrici | 36 | 3610 | 410 | 330 | Nat | osp1 | M | 155 | TRUE | TRUE | TRUE |
| 30 | 1 | NonFumatrici | 36 | 1280 | 385 | 292 | Nat | osp2 | F | 1426 | TRUE | TRUE | TRUE |
| 32 | 0 | NonFumatrici | 27 | 1140 | 370 | 267 | Nat | osp3 | F | 2118 | TRUE | FALSE | TRUE |
| 35 | 1 | NonFumatrici | 32 | 1890 | 500 | 309 | Nat | osp2 | F | 2113 | TRUE | TRUE | TRUE |
| 28 | 1 | NonFumatrici | 27 | 980 | 320 | 265 | Nat | osp1 | M | 2435 | TRUE | FALSE | TRUE |
| 32 | 4 | Fumatrici | 39 | 2250 | 460 | 319 | Nat | osp2 | F | 1424 | TRUE | FALSE | TRUE |
| 31 | 0 | NonFumatrici | 31 | 990 | 340 | 278 | Ces | osp2 | F | 1617 | TRUE | TRUE | TRUE |
| 34 | 1 | Fumatrici | 40 | 3180 | 470 | 326 | Nat | osp1 | F | 1396 | TRUE | FALSE | TRUE |
| 41 | 3 | NonFumatrici | 35 | 1500 | 420 | 304 | Nat | osp1 | M | 1591 | FALSE | TRUE | TRUE |
| 27 | 1 | NonFumatrici | 38 | 3240 | 410 | 359 | Ces | osp1 | F | 2038 | TRUE | TRUE | TRUE |
| 31 | 3 | Fumatrici | 37 | 3440 | 460 | 362 | Ces | osp1 | M | 2335 | TRUE | FALSE | TRUE |
| 27 | 0 | NonFumatrici | 35 | 3140 | 465 | 290 | Nat | osp2 | F | 2223 | FALSE | TRUE | TRUE |
| 28 | 3 | Fumatrici | 41 | 2940 | 465 | 328 | Nat | osp2 | F | 2351 | TRUE | FALSE | TRUE |
| 30 | 4 | NonFumatrici | 35 | 4520 | 520 | 360 | Nat | osp2 | F | 1551 | TRUE | TRUE | TRUE |
| 26 | 0 | Fumatrici | 40 | 3400 | 525 | 350 | Nat | osp2 | M | 391 | TRUE | FALSE | TRUE |
| 22 | 1 | Fumatrici | 40 | 3350 | 520 | 355 | Nat | osp3 | M | 99 | TRUE | FALSE | TRUE |
| 25 | 1 | Fumatrici | 38 | 3280 | 475 | 324 | Nat | osp1 | F | 567 | TRUE | FALSE | TRUE |
| 29 | 1 | Fumatrici | 41 | 4420 | 530 | 362 | Ces | osp2 | F | 140 | TRUE | FALSE | TRUE |
| 33 | 1 | Fumatrici | 38 | 3030 | 500 | 353 | Nat | osp3 | F | 1624 | TRUE | FALSE | TRUE |
| 18 | 0 | NonFumatrici | 40 | 1850 | 460 | 305 | Nat | osp3 | F | 295 | FALSE | TRUE | TRUE |
| 44 | 0 | Fumatrici | 38 | 3150 | 465 | 335 | Nat | osp3 | F | 335 | TRUE | FALSE | TRUE |
| 33 | 0 | NonFumatrici | 29 | 1620 | 410 | 292 | Nat | osp3 | F | 1383 | TRUE | FALSE | TRUE |
| 20 | 0 | Fumatrici | 42 | 3870 | 530 | 381 | Ces | osp1 | F | 2014 | TRUE | FALSE | TRUE |
| 35 | 6 | Fumatrici | 38 | 2430 | 460 | 324 | Nat | osp2 | F | 442 | TRUE | FALSE | TRUE |
| 30 | 7 | NonFumatrici | 35 | 2220 | 470 | 316 | Nat | osp3 | M | 582 | TRUE | TRUE | TRUE |
| 42 | 2 | NonFumatrici | 38 | 2560 | 525 | 349 | Ces | osp2 | M | 1397 | FALSE | TRUE | TRUE |
| 30 | 3 | NonFumatrici | 38 | 4600 | 485 | 380 | Nat | osp1 | M | 1292 | FALSE | TRUE | TRUE |
| 24 | 0 | NonFumatrici | 35 | 1450 | 405 | 280 | Nat | osp1 | F | 750 | FALSE | TRUE | TRUE |
| 31 | 0 | NonFumatrici | 34 | 1370 | 390 | 287 | Nat | osp2 | F | 101 | FALSE | TRUE | TRUE |
| 33 | 10 | NonFumatrici | 40 | 3090 | 485 | 353 | Nat | osp3 | M | 2420 | TRUE | FALSE | TRUE |
| 36 | 8 | NonFumatrici | 41 | 3730 | 480 | 335 | Nat | osp3 | M | 1448 | TRUE | FALSE | TRUE |
| 28 | 0 | NonFumatrici | 39 | 3800 | 520 | 300 | Nat | osp3 | F | 1710 | FALSE | TRUE | TRUE |
| 28 | 0 | Fumatrici | 41 | 3820 | 500 | 350 | Nat | osp2 | M | 424 | TRUE | FALSE | TRUE |
| 35 | 2 | Fumatrici | 39 | 4030 | 510 | 344 | Ces | osp2 | M | 2410 | TRUE | FALSE | TRUE |
| 23 | 1 | NonFumatrici | 36 | 3850 | 460 | 334 | Ces | osp3 | F | 1692 | FALSE | TRUE | TRUE |
| 32 | 2 | NonFumatrici | 39 | 3430 | 445 | 322 | Ces | osp1 | F | 1633 | FALSE | TRUE | TRUE |
| 31 | 2 | Fumatrici | 38 | 3350 | 510 | 348 | Nat | osp1 | M | 2280 | TRUE | FALSE | TRUE |
| 35 | 10 | NonFumatrici | 39 | 2950 | 495 | 335 | Nat | osp1 | F | 2219 | TRUE | FALSE | TRUE |
| 30 | 1 | NonFumatrici | 39 | 3000 | 475 | 390 | Ces | osp2 | F | 684 | TRUE | TRUE | TRUE |
| 29 | 0 | NonFumatrici | 40 | 3470 | 525 | 390 | Nat | osp1 | M | 1866 | FALSE | TRUE | TRUE |
| 25 | 1 | NonFumatrici | 37 | 1750 | 430 | 305 | Nat | osp3 | F | 322 | FALSE | TRUE | TRUE |
| 30 | 2 | NonFumatrici | 39 | 3790 | 505 | 304 | Ces | osp3 | M | 1393 | FALSE | TRUE | TRUE |
| 22 | 0 | NonFumatrici | 32 | 2340 | 445 | 304 | Nat | osp1 | F | 1356 | TRUE | TRUE | TRUE |
| 27 | 4 | Fumatrici | 41 | 3350 | 485 | 335 | Ces | osp3 | M | 1261 | TRUE | FALSE | TRUE |
| 32 | 4 | Fumatrici | 39 | 3400 | 510 | 356 | Nat | osp1 | M | 1284 | TRUE | FALSE | TRUE |
| 21 | 1 | Fumatrici | 40 | 2590 | 480 | 328 | Ces | osp1 | F | 413 | TRUE | FALSE | TRUE |
| 33 | 0 | NonFumatrici | 36 | 3110 | 465 | 300 | Ces | osp3 | M | 1854 | FALSE | TRUE | TRUE |
| 31 | 0 | NonFumatrici | 40 | 3410 | 550 | 372 | Nat | osp2 | M | 119 | FALSE | TRUE | TRUE |
| 33 | 0 | NonFumatrici | 30 | 1750 | 410 | 294 | Nat | osp2 | M | 2198 | TRUE | FALSE | TRUE |
| 24 | 0 | NonFumatrici | 42 | 2800 | 520 | 340 | Ces | osp2 | M | 2313 | FALSE | TRUE | TRUE |
| 34 | 4 | NonFumatrici | 42 | 2660 | 500 | 320 | Nat | osp2 | F | 1716 | TRUE | TRUE | TRUE |
| 19 | 1 | NonFumatrici | 38 | 1950 | 445 | 305 | Nat | osp2 | M | 1741 | FALSE | TRUE | TRUE |
| 35 | 1 | Fumatrici | 41 | 2910 | 490 | 340 | Nat | osp3 | M | 2108 | TRUE | FALSE | TRUE |
| 29 | 3 | Fumatrici | 41 | 3090 | 495 | 350 | Nat | osp3 | M | 473 | TRUE | FALSE | TRUE |
| 22 | 1 | NonFumatrici | 34 | 3030 | 470 | 312 | Nat | osp2 | F | 1891 | FALSE | TRUE | TRUE |
| 28 | 1 | NonFumatrici | 40 | 3790 | 460 | 332 | Nat | osp2 | F | 1267 | FALSE | TRUE | TRUE |
| 40 | 8 | NonFumatrici | 38 | 3520 | 470 | 341 | Nat | osp3 | M | 516 | TRUE | FALSE | TRUE |
| 28 | 0 | Fumatrici | 40 | 2740 | 500 | 325 | Nat | osp3 | F | 2001 | TRUE | FALSE | TRUE |
| 37 | 2 | Fumatrici | 41 | 3220 | 500 | 357 | Ces | osp3 | M | 1689 | TRUE | FALSE | TRUE |
| 27 | 2 | Fumatrici | 39 | 2920 | 490 | 351 | Ces | osp3 | M | 1523 | TRUE | FALSE | TRUE |
| 27 | 0 | Fumatrici | 42 | 3970 | 515 | 360 | Nat | osp2 | M | 1658 | TRUE | FALSE | TRUE |
| 23 | 0 | NonFumatrici | 41 | 4900 | 510 | 352 | Nat | osp2 | F | 1305 | FALSE | TRUE | TRUE |
| 31 | 4 | Fumatrici | 38 | 3530 | 495 | 335 | Nat | osp1 | F | 1969 | TRUE | FALSE | TRUE |
| 30 | 8 | NonFumatrici | 39 | 2860 | 490 | 337 | Ces | osp2 | F | 1503 | TRUE | FALSE | TRUE |
| 26 | 3 | NonFumatrici | 31 | 1960 | 420 | 300 | Nat | osp2 | F | 1067 | TRUE | FALSE | TRUE |
| 27 | 0 | Fumatrici | 41 | 3160 | 490 | 323 | Nat | osp1 | F | 1026 | TRUE | FALSE | TRUE |
| 23 | 1 | NonFumatrici | 40 | 3520 | 445 | 363 | Nat | osp1 | F | 220 | FALSE | FALSE | TRUE |
| 26 | 1 | NonFumatrici | 30 | 1170 | 370 | 273 | Nat | osp3 | M | 2305 | TRUE | FALSE | TRUE |
| 26 | 8 | NonFumatrici | 40 | 3250 | 500 | 355 | Nat | osp2 | M | 2084 | TRUE | FALSE | TRUE |
| 29 | 1 | NonFumatrici | 41 | 4010 | 470 | 353 | Nat | osp2 | M | 1229 | FALSE | TRUE | TRUE |
| 38 | 4 | NonFumatrici | 38 | 4370 | 530 | 340 | Nat | osp3 | F | 1960 | FALSE | TRUE | TRUE |
| 34 | 10 | NonFumatrici | 38 | 2880 | 470 | 345 | Ces | osp2 | M | 2469 | TRUE | FALSE | TRUE |
| 38 | 0 | NonFumatrici | 40 | 3700 | 470 | 320 | Nat | osp1 | M | 390 | FALSE | TRUE | TRUE |
| 38 | 3 | NonFumatrici | 41 | 2320 | 450 | 330 | Nat | osp2 | F | 656 | FALSE | TRUE | TRUE |
| 29 | 1 | Fumatrici | 40 | 4180 | 520 | 355 | Ces | osp1 | M | 2047 | TRUE | FALSE | TRUE |
| 34 | 2 | NonFumatrici | 33 | 1410 | 380 | 295 | Nat | osp2 | F | 492 | TRUE | FALSE | TRUE |
| 41 | 1 | NonFumatrici | 39 | 2160 | 450 | 325 | Ces | osp3 | M | 1193 | FALSE | TRUE | TRUE |
| 37 | 0 | NonFumatrici | 41 | 2420 | 490 | 300 | Ces | osp1 | M | 1554 | FALSE | TRUE | TRUE |
| 22 | 0 | NonFumatrici | 32 | 1430 | 380 | 301 | Nat | osp1 | M | 1699 | TRUE | FALSE | TRUE |
| 40 | 6 | NonFumatrici | 34 | 1840 | 430 | 305 | Nat | osp1 | F | 1310 | TRUE | FALSE | TRUE |
| 22 | 0 | Fumatrici | 39 | 3290 | 510 | 346 | Nat | osp3 | M | 1761 | TRUE | FALSE | TRUE |
| 32 | 1 | NonFumatrici | 33 | 2040 | 480 | 307 | Ces | osp1 | F | 1272 | FALSE | FALSE | TRUE |
| 26 | 1 | Fumatrici | 38 | 3060 | 490 | 350 | Nat | osp3 | M | 2283 | TRUE | FALSE | TRUE |
| 30 | 2 | Fumatrici | 40 | 3130 | 480 | 320 | Nat | osp1 | M | 1759 | TRUE | FALSE | TRUE |
| 38 | 6 | NonFumatrici | 37 | 3950 | 500 | 350 | Nat | osp3 | M | 134 | FALSE | FALSE | TRUE |
| 27 | 1 | NonFumatrici | 35 | 2100 | 440 | 345 | Ces | osp2 | F | 615 | FALSE | TRUE | TRUE |
| 29 | 0 | NonFumatrici | 42 | 3540 | 540 | 368 | Nat | osp1 | M | 616 | FALSE | TRUE | TRUE |
| 25 | 6 | NonFumatrici | 38 | 2530 | 460 | 340 | Nat | osp1 | F | 2315 | TRUE | FALSE | TRUE |
| 33 | 2 | Fumatrici | 40 | 3320 | 490 | 320 | Nat | osp3 | M | 1650 | TRUE | FALSE | TRUE |
| 32 | 3 | NonFumatrici | 36 | 3780 | 480 | 349 | Nat | osp3 | M | 1340 | FALSE | TRUE | TRUE |
| 30 | 0 | NonFumatrici | 38 | 4540 | 530 | 343 | Ces | osp3 | M | 1539 | FALSE | TRUE | TRUE |
| 22 | 0 | NonFumatrici | 42 | 3100 | 510 | 361 | Nat | osp2 | F | 1586 | FALSE | TRUE | TRUE |
| 31 | 0 | NonFumatrici | 37 | 2040 | 465 | 305 | Nat | osp2 | F | 1508 | FALSE | TRUE | TRUE |
| 35 | 0 | NonFumatrici | 33 | 1390 | 390 | 277 | Nat | osp1 | F | 748 | TRUE | FALSE | TRUE |
| 23 | 1 | Fumatrici | 41 | 3590 | 500 | 349 | Nat | osp3 | F | 105 | TRUE | FALSE | TRUE |
| 38 | 3 | Fumatrici | 40 | 3300 | 510 | 345 | Nat | osp3 | M | 234 | TRUE | FALSE | TRUE |
| 40 | 0 | NonFumatrici | 34 | 1580 | 400 | 300 | Nat | osp2 | F | 2255 | FALSE | FALSE | TRUE |
| 27 | 1 | NonFumatrici | 38 | 2480 | 500 | 315 | Ces | osp2 | M | 1913 | FALSE | TRUE | TRUE |
| 37 | 3 | Fumatrici | 39 | 3150 | 470 | 341 | Nat | osp1 | F | 1395 | TRUE | FALSE | TRUE |
| 32 | 5 | NonFumatrici | 36 | 2400 | 470 | 333 | Nat | osp1 | F | 1008 | FALSE | FALSE | TRUE |
| 35 | 1 | NonFumatrici | 39 | 2660 | 460 | 364 | Ces | osp1 | F | 1400 | FALSE | FALSE | TRUE |
| 37 | 1 | NonFumatrici | 39 | 2500 | 490 | 352 | Ces | osp1 | M | 2217 | FALSE | TRUE | TRUE |
| 29 | 4 | NonFumatrici | 30 | 1340 | 400 | 273 | Ces | osp1 | M | 106 | TRUE | FALSE | TRUE |
| 26 | 1 | NonFumatrici | 30 | 1170 | 370 | 266 | Nat | osp2 | M | 1247 | TRUE | FALSE | TRUE |
| 30 | 1 | NonFumatrici | 41 | 4440 | 510 | 335 | Nat | osp3 | M | 791 | FALSE | TRUE | TRUE |
| 32 | 6 | NonFumatrici | 40 | 4200 | 510 | 350 | Nat | osp1 | M | 828 | FALSE | FALSE | TRUE |
| 35 | 9 | NonFumatrici | 42 | 3760 | 540 | 348 | Nat | osp2 | F | 161 | TRUE | FALSE | TRUE |
| 33 | 3 | NonFumatrici | 40 | 3640 | 470 | 337 | Nat | osp3 | F | 2285 | FALSE | TRUE | TRUE |
| 39 | 2 | NonFumatrici | 35 | 1980 | 450 | 308 | Ces | osp1 | M | 724 | FALSE | TRUE | TRUE |
| 20 | 0 | Fumatrici | 38 | 2990 | 470 | 330 | Nat | osp2 | F | 557 | TRUE | FALSE | TRUE |
| 30 | 0 | NonFumatrici | 40 | 2380 | 480 | 305 | Nat | osp2 | M | 1217 | FALSE | FALSE | TRUE |
| 27 | 1 | NonFumatrici | 38 | 2150 | 450 | 315 | Ces | osp3 | M | 458 | FALSE | TRUE | TRUE |
| 26 | 0 | NonFumatrici | 34 | 2050 | 460 | 325 | Nat | osp1 | F | 1211 | FALSE | FALSE | TRUE |
| 37 | 4 | NonFumatrici | 37 | 3720 | 480 | 355 | Ces | osp1 | F | 1470 | FALSE | FALSE | TRUE |
| 39 | 1 | NonFumatrici | 31 | 1500 | 405 | 295 | Nat | osp3 | M | 206 | TRUE | FALSE | FALSE |
| 25 | 0 | NonFumatrici | 39 | 3290 | 450 | 338 | Nat | osp3 | F | 2056 | FALSE | TRUE | FALSE |
| 30 | 2 | Fumatrici | 37 | 3280 | 500 | 344 | Ces | osp1 | M | 128 | TRUE | FALSE | FALSE |
| 40 | 2 | Fumatrici | 38 | 3440 | 495 | 337 | Nat | osp3 | F | 1900 | TRUE | FALSE | FALSE |
| 27 | 2 | NonFumatrici | 39 | 2260 | 460 | 330 | Nat | osp2 | F | 1486 | FALSE | TRUE | FALSE |
| 37 | 2 | NonFumatrici | 36 | 3710 | 480 | 350 | Nat | osp3 | M | 2130 | FALSE | TRUE | FALSE |
| 18 | 0 | NonFumatrici | 36 | 3050 | 460 | 315 | Nat | osp2 | F | 1556 | FALSE | TRUE | FALSE |
| 23 | 2 | Fumatrici | 40 | 2950 | 495 | 340 | Ces | osp2 | F | 1037 | TRUE | FALSE | FALSE |
| 22 | 1 | Fumatrici | 40 | 3670 | 500 | 340 | Nat | osp1 | M | 120 | TRUE | FALSE | FALSE |
| 22 | 0 | NonFumatrici | 32 | 2580 | 470 | 330 | Nat | osp1 | F | 2214 | TRUE | FALSE | FALSE |
| 33 | 2 | NonFumatrici | 39 | 3660 | 490 | 318 | Nat | osp2 | F | 890 | FALSE | TRUE | FALSE |
| 38 | 1 | NonFumatrici | 40 | 3980 | 480 | 335 | Nat | osp1 | F | 2193 | FALSE | TRUE | FALSE |
| 19 | 0 | NonFumatrici | 36 | 2100 | 445 | 332 | Ces | osp1 | F | 1754 | FALSE | TRUE | FALSE |
| 22 | 0 | Fumatrici | 40 | 4220 | 510 | 375 | Nat | osp1 | M | 1581 | TRUE | FALSE | FALSE |
| 23 | 3 | Fumatrici | 41 | 2620 | 475 | 313 | Nat | osp3 | M | 2218 | TRUE | FALSE | FALSE |
| 30 | 0 | Fumatrici | 38 | 3380 | 490 | 340 | Nat | osp3 | F | 1169 | TRUE | FALSE | FALSE |
| 26 | 1 | NonFumatrici | 32 | 1280 | 360 | 276 | Nat | osp2 | M | 312 | TRUE | FALSE | FALSE |
| 31 | 1 | NonFumatrici | 39 | 3400 | 460 | 325 | Nat | osp1 | M | 377 | FALSE | TRUE | FALSE |
| 30 | 2 | NonFumatrici | 39 | 4240 | 485 | 352 | Nat | osp2 | M | 130 | FALSE | TRUE | FALSE |
| 18 | 1 | Fumatrici | 40 | 3600 | 500 | 335 | Nat | osp3 | M | 296 | TRUE | FALSE | FALSE |
| 35 | 2 | NonFumatrici | 39 | 3300 | 460 | 320 | Ces | osp1 | M | 908 | FALSE | TRUE | FALSE |
| 33 | 3 | NonFumatrici | 39 | 3950 | 480 | 345 | Nat | osp3 | M | 2133 | FALSE | TRUE | FALSE |
| 24 | 1 | NonFumatrici | 40 | 3820 | 500 | 320 | Ces | osp2 | F | 146 | FALSE | TRUE | FALSE |
| 36 | 1 | NonFumatrici | 41 | 2940 | 500 | 358 | Nat | osp3 | M | 1935 | FALSE | TRUE | FALSE |
| 30 | 1 | NonFumatrici | 39 | 2380 | 495 | 325 | Nat | osp1 | F | 1497 | FALSE | TRUE | FALSE |
| 27 | 2 | Fumatrici | 37 | 2850 | 460 | 335 | Nat | osp1 | F | 251 | TRUE | FALSE | FALSE |
| 33 | 0 | NonFumatrici | 37 | 2230 | 455 | 326 | Nat | osp1 | M | 1748 | FALSE | TRUE | FALSE |
| 26 | 0 | Fumatrici | 40 | 3570 | 500 | 335 | Nat | osp2 | M | 1456 | TRUE | FALSE | FALSE |
| 21 | 1 | Fumatrici | 41 | 3270 | 510 | 340 | Nat | osp2 | M | 1904 | TRUE | FALSE | FALSE |
| 21 | 0 | NonFumatrici | 37 | 2750 | 510 | 333 | Nat | osp1 | M | 632 | FALSE | TRUE | FALSE |
| 30 | 0 | Fumatrici | 40 | 3120 | 500 | 344 | Nat | osp3 | M | 984 | TRUE | FALSE | FALSE |
| 27 | 0 | NonFumatrici | 28 | 1285 | 400 | 274 | Nat | osp1 | F | 378 | TRUE | FALSE | FALSE |
| 22 | 2 | NonFumatrici | 38 | 2410 | 470 | 336 | Nat | osp2 | M | 364 | FALSE | TRUE | FALSE |
| 24 | 2 | NonFumatrici | 40 | 2560 | 490 | 323 | Nat | osp1 | M | 1191 | FALSE | TRUE | FALSE |
| 28 | 1 | Fumatrici | 40 | 2850 | 490 | 334 | Nat | osp1 | F | 2143 | TRUE | FALSE | FALSE |
| 35 | 3 | NonFumatrici | 40 | 2660 | 485 | 335 | Ces | osp3 | M | 1286 | FALSE | TRUE | FALSE |
| 35 | 1 | NonFumatrici | 38 | 2850 | 500 | 360 | Nat | osp1 | F | 648 | FALSE | TRUE | FALSE |
| 25 | 2 | NonFumatrici | 41 | 3990 | 495 | 335 | Nat | osp3 | M | 472 | FALSE | TRUE | FALSE |
| 32 | 2 | Fumatrici | 38 | 2920 | 480 | 348 | Nat | osp2 | F | 1051 | TRUE | FALSE | FALSE |
| 28 | 1 | NonFumatrici | 40 | 2410 | 470 | 319 | Nat | osp1 | M | 417 | FALSE | TRUE | FALSE |
| 34 | 0 | NonFumatrici | 39 | 3180 | 470 | 310 | Nat | osp3 | F | 195 | FALSE | TRUE | FALSE |
| 26 | 2 | Fumatrici | 39 | 3530 | 505 | 332 | Ces | osp2 | F | 194 | TRUE | FALSE | FALSE |
| 26 | 0 | NonFumatrici | 40 | 4330 | 500 | 355 | Nat | osp3 | F | 1036 | FALSE | TRUE | FALSE |
| 24 | 0 | NonFumatrici | 38 | 2400 | 460 | 335 | Nat | osp3 | M | 1877 | FALSE | TRUE | FALSE |
| 21 | 0 | NonFumatrici | 41 | 2700 | 500 | 336 | Nat | osp2 | F | 653 | FALSE | TRUE | FALSE |
| 37 | 8 | NonFumatrici | 28 | 930 | 355 | 235 | Nat | osp1 | F | 2173 | TRUE | FALSE | FALSE |
| 30 | 0 | NonFumatrici | 40 | 3860 | 480 | 338 | Nat | osp1 | M | 633 | FALSE | TRUE | FALSE |
| 24 | 1 | Fumatrici | 39 | 3400 | 485 | 344 | Nat | osp1 | F | 242 | TRUE | FALSE | FALSE |
| 25 | 0 | NonFumatrici | 41 | 2980 | 500 | 356 | Nat | osp3 | M | 950 | FALSE | TRUE | FALSE |
| 37 | 3 | NonFumatrici | 33 | 2000 | 470 | 293 | Ces | osp1 | F | 1608 | TRUE | FALSE | FALSE |
| 21 | 0 | NonFumatrici | 39 | 3580 | 485 | 320 | Nat | osp3 | M | 318 | FALSE | TRUE | FALSE |
| 27 | 1 | NonFumatrici | 39 | 4650 | 510 | 354 | Nat | osp2 | M | 2021 | FALSE | TRUE | FALSE |
| 34 | 3 | NonFumatrici | 40 | 4580 | 515 | 360 | Nat | osp2 | M | 1836 | FALSE | TRUE | FALSE |
| 21 | 0 | NonFumatrici | 39 | 3950 | 500 | 330 | Nat | osp1 | F | 1132 | FALSE | TRUE | FALSE |
| 26 | 0 | NonFumatrici | 40 | 3580 | 480 | 326 | Nat | osp1 | F | 418 | FALSE | TRUE | FALSE |
| 17 | 0 | NonFumatrici | 37 | 2050 | 390 | 295 | Nat | osp2 | F | 1014 | TRUE | FALSE | FALSE |
| 35 | 1 | Fumatrici | 39 | 2440 | 450 | 326 | Nat | osp1 | F | 1332 | TRUE | FALSE | FALSE |
| 20 | 0 | NonFumatrici | 38 | 3700 | 480 | 335 | Nat | osp3 | F | 5 | FALSE | TRUE | FALSE |
| 20 | 1 | NonFumatrici | 40 | 2510 | 480 | 322 | Ces | osp2 | M | 1419 | FALSE | TRUE | FALSE |
| 30 | 1 | NonFumatrici | 32 | 2260 | 440 | 322 | Ces | osp3 | F | 592 | TRUE | FALSE | FALSE |
| 30 | 6 | NonFumatrici | 35 | 2680 | 450 | 322 | Nat | osp1 | F | 757 | TRUE | FALSE | FALSE |
| 28 | 1 | NonFumatrici | 39 | 2600 | 480 | 350 | Nat | osp1 | F | 262 | FALSE | TRUE | FALSE |
| 25 | 1 | NonFumatrici | 41 | 2950 | 515 | 333 | Ces | osp2 | M | 460 | FALSE | TRUE | FALSE |
| 24 | 0 | NonFumatrici | 38 | 3920 | 505 | 333 | Nat | osp3 | M | 2010 | FALSE | TRUE | FALSE |
| 25 | 1 | Fumatrici | 42 | 2980 | 490 | 335 | Nat | osp2 | M | 758 | TRUE | FALSE | FALSE |
| 26 | 0 | NonFumatrici | 40 | 2660 | 500 | 343 | Nat | osp1 | F | 2396 | FALSE | TRUE | FALSE |
| 34 | 1 | Fumatrici | 40 | 3310 | 490 | 330 | Nat | osp3 | M | 1118 | TRUE | FALSE | FALSE |
| 28 | 0 | Fumatrici | 39 | 3310 | 500 | 355 | Ces | osp1 | M | 1718 | TRUE | FALSE | FALSE |
| 35 | 3 | Fumatrici | 38 | 2750 | 475 | 332 | Nat | osp1 | F | 1670 | TRUE | FALSE | FALSE |
| 34 | 0 | NonFumatrici | 41 | 3940 | 500 | 340 | Nat | osp1 | F | 623 | FALSE | TRUE | FALSE |
| 24 | 0 | NonFumatrici | 38 | 2920 | 505 | 351 | Ces | osp3 | M | 455 | FALSE | TRUE | FALSE |
| 34 | 1 | NonFumatrici | 40 | 2580 | 495 | 325 | Nat | osp3 | F | 1210 | FALSE | TRUE | FALSE |
| 25 | 6 | NonFumatrici | 33 | 2230 | 430 | 313 | Nat | osp3 | F | 2357 | TRUE | FALSE | FALSE |
| 30 | 8 | NonFumatrici | 40 | 3850 | 518 | 340 | Nat | osp3 | F | 204 | TRUE | FALSE | FALSE |
| 39 | 1 | Fumatrici | 39 | 2770 | 470 | 342 | Nat | osp3 | F | 1574 | TRUE | FALSE | FALSE |
| 27 | 0 | NonFumatrici | 32 | 1550 | 410 | 289 | Nat | osp1 | F | 445 | TRUE | FALSE | FALSE |
| 28 | 0 | NonFumatrici | 38 | 3720 | 500 | 330 | Nat | osp3 | F | 1517 | FALSE | TRUE | FALSE |
| 29 | 0 | Fumatrici | 37 | 2440 | 460 | 315 | Nat | osp1 | F | 658 | TRUE | FALSE | FALSE |
| 27 | 2 | Fumatrici | 39 | 2880 | 490 | 325 | Ces | osp1 | M | 2359 | TRUE | FALSE | FALSE |
| 26 | 0 | NonFumatrici | 40 | 3000 | 520 | 340 | Ces | osp3 | M | 1137 | FALSE | TRUE | FALSE |
| 36 | 0 | NonFumatrici | 31 | 1180 | 355 | 270 | Nat | osp3 | F | 2112 | TRUE | FALSE | FALSE |
| 32 | 1 | Fumatrici | 33 | 1780 | 400 | 305 | Ces | osp1 | F | 2087 | TRUE | FALSE | FALSE |
| 30 | 1 | Fumatrici | 37 | 4050 | 530 | 350 | Nat | osp1 | M | 2243 | TRUE | FALSE | FALSE |
| 35 | 5 | Fumatrici | 40 | 4300 | 530 | 360 | Nat | osp3 | M | 2044 | TRUE | FALSE | FALSE |
| 25 | 0 | NonFumatrici | 40 | 2990 | 515 | 350 | Ces | osp1 | F | 2183 | FALSE | TRUE | FALSE |
| 19 | 0 | Fumatrici | 40 | 3380 | 500 | 361 | Nat | osp2 | M | 1441 | TRUE | FALSE | FALSE |
| 35 | 0 | NonFumatrici | 32 | 1780 | 420 | 277 | Ces | osp1 | F | 1807 | TRUE | FALSE | FALSE |
| 27 | 1 | NonFumatrici | 40 | 3890 | 495 | 330 | Nat | osp2 | M | 2368 | FALSE | TRUE | FALSE |
| 28 | 1 | NonFumatrici | 39 | 3920 | 485 | 345 | Nat | osp2 | F | 850 | FALSE | TRUE | FALSE |
| 25 | 0 | NonFumatrici | 28 | 830 | 310 | 254 | Nat | osp1 | F | 928 | TRUE | FALSE | FALSE |
| 27 | 0 | Fumatrici | 39 | 2990 | 485 | 350 | Nat | osp1 | F | 1931 | TRUE | FALSE | FALSE |
| 31 | 1 | NonFumatrici | 38 | 3920 | 490 | 345 | Nat | osp2 | M | 1942 | FALSE | TRUE | FALSE |
| 25 | 0 | NonFumatrici | 41 | 2210 | 430 | 310 | Nat | osp3 | F | 956 | TRUE | FALSE | FALSE |
| 34 | 0 | NonFumatrici | 39 | 2690 | 495 | 335 | Nat | osp1 | M | 2149 | FALSE | TRUE | FALSE |
| 31 | 1 | NonFumatrici | 38 | 2670 | 500 | 336 | Ces | osp2 | F | 980 | FALSE | TRUE | FALSE |
| 25 | 0 | Fumatrici | 40 | 3860 | 520 | 345 | Nat | osp1 | M | 703 | TRUE | FALSE | FALSE |
| 39 | 3 | NonFumatrici | 30 | 1300 | 380 | 276 | Nat | osp1 | M | 2147 | TRUE | FALSE | FALSE |
| 21 | 0 | NonFumatrici | 40 | 4050 | 500 | 340 | Nat | osp3 | M | 1583 | FALSE | TRUE | FALSE |
| 31 | 1 | Fumatrici | 40 | 3310 | 490 | 334 | Ces | osp2 | M | 2457 | TRUE | FALSE | FALSE |
| 29 | 1 | NonFumatrici | 39 | 3800 | 490 | 333 | Ces | osp3 | M | 90 | FALSE | TRUE | FALSE |
| 30 | 2 | NonFumatrici | 29 | 1190 | 360 | 272 | Nat | osp2 | F | 805 | TRUE | FALSE | FALSE |
| 30 | 1 | Fumatrici | 39 | 3100 | 485 | 325 | Nat | osp3 | F | 2200 | TRUE | FALSE | FALSE |
| 26 | 1 | NonFumatrici | 39 | 2850 | 500 | 347 | Ces | osp1 | M | 2422 | FALSE | TRUE | FALSE |
| 25 | 1 | NonFumatrici | 39 | 2720 | 495 | 345 | Nat | osp3 | F | 361 | FALSE | TRUE | FALSE |
| 23 | 1 | NonFumatrici | 39 | 4100 | 505 | 350 | Nat | osp1 | F | 1206 | FALSE | TRUE | FALSE |
| 34 | 1 | Fumatrici | 39 | 2910 | 490 | 325 | Nat | osp2 | M | 932 | TRUE | FALSE | FALSE |
| 31 | 1 | Fumatrici | 40 | 2850 | 485 | 330 | Nat | osp1 | F | 1110 | TRUE | FALSE | FALSE |
| 38 | 12 | NonFumatrici | 39 | 3350 | 490 | 344 | Nat | osp2 | M | 1218 | TRUE | FALSE | FALSE |
| 24 | 3 | Fumatrici | 40 | 3020 | 490 | 338 | Nat | osp2 | F | 1325 | TRUE | FALSE | FALSE |
| 36 | 8 | NonFumatrici | 36 | 2860 | 460 | 334 | Nat | osp2 | F | 1725 | TRUE | FALSE | FALSE |
| 37 | 2 | NonFumatrici | 31 | 1690 | 405 | 290 | Nat | osp2 | M | 2406 | TRUE | FALSE | FALSE |
| 19 | 0 | Fumatrici | 39 | 2710 | 470 | 331 | Nat | osp3 | F | 321 | TRUE | FALSE | FALSE |
| 16 | 1 | NonFumatrici | 31 | 1900 | 440 | 300 | Nat | osp2 | F | 587 | TRUE | FALSE | FALSE |
| 33 | 11 | NonFumatrici | 43 | 3400 | 475 | 360 | Nat | osp1 | M | 1130 | TRUE | FALSE | FALSE |
| 21 | 1 | Fumatrici | 40 | 3550 | 510 | 335 | Ces | osp3 | M | 1413 | TRUE | FALSE | FALSE |
| 32 | 6 | NonFumatrici | 39 | 3290 | 480 | 360 | Ces | osp2 | M | 1409 | TRUE | FALSE | FALSE |
| 27 | 0 | NonFumatrici | 31 | 1800 | 430 | 308 | Ces | osp3 | M | 1684 | TRUE | FALSE | FALSE |
| 23 | 0 | Fumatrici | 39 | 3100 | 500 | 325 | Nat | osp2 | M | 1367 | TRUE | FALSE | FALSE |
| 27 | 1 | Fumatrici | 37 | 3110 | 485 | 348 | Nat | osp2 | F | 182 | TRUE | FALSE | FALSE |
| 32 | 0 | Fumatrici | 40 | 3510 | 500 | 346 | Nat | osp2 | M | 1281 | TRUE | FALSE | FALSE |
| 30 | 1 | NonFumatrici | 33 | 1770 | 410 | 275 | Nat | osp3 | M | 1091 | TRUE | FALSE | FALSE |
| 36 | 6 | NonFumatrici | 40 | 4340 | 515 | 383 | Nat | osp1 | M | 1320 | TRUE | FALSE | FALSE |
| 20 | 0 | NonFumatrici | 41 | 2280 | 450 | 280 | Ces | osp3 | M | 151 | TRUE | FALSE | FALSE |
| 25 | 0 | Fumatrici | 41 | 3640 | 510 | 350 | Nat | osp3 | M | 572 | TRUE | FALSE | FALSE |
| 27 | 0 | Fumatrici | 40 | 3870 | 530 | 340 | Nat | osp1 | M | 668 | TRUE | FALSE | FALSE |
| 24 | 1 | Fumatrici | 39 | 2830 | 485 | 315 | Ces | osp1 | M | 1017 | TRUE | FALSE | FALSE |
| 26 | 1 | Fumatrici | 37 | 3140 | 490 | 335 | Nat | osp2 | F | 2236 | TRUE | FALSE | FALSE |
| 36 | 4 | Fumatrici | 40 | 3480 | 495 | 347 | Nat | osp1 | M | 1226 | TRUE | FALSE | FALSE |
| 33 | 6 | NonFumatrici | 37 | 2900 | 465 | 350 | Ces | osp2 | F | 2146 | TRUE | FALSE | FALSE |
| 36 | 2 | Fumatrici | 40 | 3040 | 470 | 345 | Nat | osp1 | M | 2463 | TRUE | FALSE | FALSE |
| 28 | 2 | Fumatrici | 41 | 3860 | 520 | 353 | Nat | osp2 | M | 1174 | TRUE | FALSE | FALSE |
| 32 | 1 | Fumatrici | 40 | 3170 | 500 | 330 | Ces | osp2 | M | 593 | TRUE | FALSE | FALSE |
| 31 | 0 | NonFumatrici | 31 | 1730 | 430 | 300 | Nat | osp3 | F | 2456 | TRUE | FALSE | FALSE |
| 33 | 0 | Fumatrici | 40 | 2940 | 470 | 352 | Nat | osp2 | F | 1408 | TRUE | FALSE | FALSE |
| 31 | 1 | Fumatrici | 39 | 4020 | 530 | 356 | Nat | osp3 | F | 1535 | TRUE | FALSE | FALSE |
| 24 | 5 | NonFumatrici | 42 | 3600 | 510 | 335 | Ces | osp3 | M | 985 | TRUE | FALSE | FALSE |
| 28 | 1 | Fumatrici | 40 | 3270 | 480 | 363 | Nat | osp1 | M | 538 | TRUE | FALSE | FALSE |
| 29 | 1 | Fumatrici | 39 | 3110 | 500 | 327 | Nat | osp1 | F | 2235 | TRUE | FALSE | FALSE |
| 35 | 9 | NonFumatrici | 37 | 3150 | 490 | 335 | Nat | osp2 | M | 1779 | TRUE | FALSE | FALSE |
| 33 | 1 | Fumatrici | 40 | 3860 | 525 | 346 | Ces | osp1 | M | 2448 | TRUE | FALSE | FALSE |
| 25 | 1 | Fumatrici | 37 | 2740 | 475 | 321 | Nat | osp2 | F | 1787 | TRUE | FALSE | FALSE |
| 21 | 0 | Fumatrici | 40 | 3050 | 495 | 322 | Ces | osp3 | F | 1510 | TRUE | FALSE | FALSE |
| 38 | 2 | Fumatrici | 38 | 2970 | 465 | 349 | Nat | osp3 | M | 1478 | TRUE | FALSE | FALSE |
| 19 | 0 | Fumatrici | 40 | 3320 | 500 | 335 | Nat | osp1 | M | 279 | TRUE | FALSE | FALSE |
| 24 | 1 | Fumatrici | 38 | 2850 | 465 | 340 | Ces | osp2 | F | 1471 | TRUE | FALSE | FALSE |
| 23 | 0 | Fumatrici | 38 | 3310 | 495 | 340 | Ces | osp3 | M | 306 | TRUE | FALSE | FALSE |
| 34 | 3 | NonFumatrici | 32 | 1615 | 390 | 297 | Nat | osp3 | F | 947 | TRUE | FALSE | FALSE |
| 20 | 0 | Fumatrici | 39 | 2920 | 490 | 320 | Nat | osp2 | F | 699 | TRUE | FALSE | FALSE |
| 36 | 8 | NonFumatrici | 39 | 3610 | 500 | 351 | Ces | osp1 | M | 89 | TRUE | FALSE | FALSE |
| 24 | 0 | Fumatrici | 40 | 3130 | 490 | 342 | Ces | osp2 | F | 2254 | TRUE | FALSE | FALSE |
| 33 | 1 | Fumatrici | 37 | 2570 | 460 | 320 | Nat | osp2 | F | 1378 | TRUE | FALSE | FALSE |
| 32 | 1 | Fumatrici | 37 | 3040 | 490 | 330 | Nat | osp2 | F | 1447 | TRUE | FALSE | FALSE |
| 25 | 3 | Fumatrici | 40 | 3250 | 495 | 334 | Nat | osp1 | M | 2268 | TRUE | FALSE | FALSE |
| 26 | 1 | Fumatrici | 39 | 3080 | 485 | 340 | Nat | osp2 | F | 1270 | TRUE | FALSE | FALSE |
| 20 | 0 | Fumatrici | 36 | 2650 | 470 | 312 | Ces | osp3 | M | 1927 | TRUE | FALSE | FALSE |
| 21 | 1 | Fumatrici | 39 | 3250 | 500 | 336 | Nat | osp2 | F | 1423 | TRUE | FALSE | FALSE |
| 34 | 2 | Fumatrici | 40 | 3380 | 485 | 361 | Nat | osp2 | M | 2099 | TRUE | FALSE | FALSE |
#Plot for diagnosis of detected values
plot_diagnosis <- function(data, xvar, yvar = "Peso", groupvar = NULL) {
p <- ggplot(data, aes_string(x = xvar, y = yvar)) +
geom_point(alpha = 0.5) +
geom_point(data=filter(data, HighLeverage), color="red", size=3) +
geom_point(data=filter(data, Outlier), shape=1, color="blue", size=3) +
geom_point(data=filter(data, Influential), shape=4, color="black", size=3) +
geom_smooth(method = "loess")+
theme_classic()
if (!is.null(groupvar)) {
p <- p + facet_wrap(as.formula(paste0("~", groupvar)))
}
p <- p +
theme_minimal() +
labs(
title = paste("Influential points highlited on", yvar, "vs", xvar),
subtitle = "Red: HighLeverage, Blue: Outlier, Black: Influential",
x = xvar,
y = yvar
)
return(p)
}
vars_to_plot <- c("Gestazione", "Lunghezza", "N.gravidanze", "Cranio")
plots_list <- lapply(vars_to_plot, function(var) {
# Add grouping only if variables is Gestazione or Lunghezza
if (var %in% c("Gestazione", "Lunghezza")) {
group_var <- "Fumatrici"
} else {
group_var <- "Sesso"
}
plot_diagnosis(final_influential, xvar = var, yvar = "Peso", groupvar = group_var)
})
#Plots for each regressor
print(plots_list[[1]])
print(plots_list[[2]])
print(plots_list[[3]])
print(plots_list[[4]])
#Plot of residuals vs fitted
df_resid <- data.frame(
Fitted = fitted(mod7_cent),
Residuals = resid(mod7_cent),
StdResid = rstandard(mod7_cent),
Index = 1:length(resid(mod7_cent))
)
#Residuals vs fitted
ggplot(df_resid, aes(x = Fitted, y = Residuals)) +
geom_point(alpha=0.5) +
geom_hline(yintercept = 0, linetype="dashed") +
geom_smooth(method = "loess", se=FALSE, color="red") +
labs(title = "Residuals vs predicted values", x = "Predicted values", y = "Residuals") +
theme_classic()
#Calculation of R^2 and RMSE of the best model
R_squared <- summary(mod7_cent)$r.squared
R_squared <- round(R_squared, 4)
#Calculation of RMSE of the best model
predicted <- predict(mod7_cent)
actual <- log(data$Peso)
RMSE <- sqrt(mean((actual - predicted)^2))
RMSE <- round(RMSE, 4)
#Table of R^2 and RMSE of the best model
df_res <- data.frame(R_squared, RMSE)
knitr::kable(df_res, caption = "Table of R^2 and RMSE of the best model")
| R_squared | RMSE |
|---|---|
| 0.7903 | 0.0839 |
Key observations from the diagnostic evaluation of the final model:
Outliers and influential points: the analysis of leverage, studentized residuals, and Cook’s distance identifies few cases with potential significant influence. These observations have extreme characteristics (e.g., multiple pregnancies or body measurements outside the common range) but are biologically plausible and not necessarily data errors.
Variable interactions: some influential observations are linked to specific combinations of key variables such as gestational age, length, and maternal smoking, reflecting important interactions in the model. Considering the overall good quality and interpretability, no modifications or exclusions are performed.
Residual analysis: plots of residuals versus fitted values show a generally homogeneous distribution without systematic patterns, confirming the model’s adequacy. However, some heteroscedasticity remains, which is addressed by the logarithmic transformation of the response variable.
Model adequacy and predictive accuracy: with an R² of about 0.79, the model explains a large portion of the variability in the log-transformed birth weight. The RMSE of about 0.084 indicates good predictive accuracy and limited discrepancy between observed and predicted values, confirming the model’s validity and reliability.
Overall impact: although influential points are identified, no evidence arises that compromises the overall validity or predictive power. The model balances parsimony and clarity in presenting relevant effects and adequately captures data variability. For this reason and to preserve these characteristics, it has been decided not to further explore robust model management methods.
This section demonstrates how to use the model to make predictions of the newborn’s weight based on the model parameters, highlighting how the model performs inference to find the best estimate for the target variable.
#Prediction with some example values
predict_weight <- function(N_gravidanze, Fumatrici, Gestazione, Lunghezza, Cranio, Sesso, model, mean_gest, mean_lung) {
#Control of factorial levels
if(!Fumatrici %in% levels(model$model$Fumatrici)) {
stop("Invalid value for Fumatrici")
}
if(!Sesso %in% unique(model$model$Sesso)) {
stop("Invalid value for Sesso")
}
#Centering of variable
Gestazione_c <- as.double(Gestazione - mean_gest)
Gestazione_c2 <- as.double(Gestazione_c^2)
Lunghezza_c <- as.double(Lunghezza - mean_lung)
#Creation of matrix 1x1 as the same class of variables in model
Gestazione_c <- matrix(Gestazione_c, nrow = 1, ncol = 1)
Gestazione_c2 <- matrix(Gestazione_c2, nrow = 1, ncol = 1)
Lunghezza_c <- matrix(Lunghezza_c, nrow = 1, ncol = 1)
#New data
newdata <- data.frame(
N.gravidanze = as.numeric(N_gravidanze),
Fumatrici = factor(Fumatrici, levels = levels(model$model$Fumatrici)),
Cranio = as.numeric(Cranio),
Sesso = as.character(Sesso)
)
#Add matrix for variables Gestazione_c, Gestazione_c2 and Lunghezza_c
newdata$Gestazione_c <- I(Gestazione_c)
newdata$Gestazione_c2 <- I(Gestazione_c2)
newdata$Lunghezza_c <- I(Lunghezza_c)
#Logaritmic prediction
log_pred <- predict(model, newdata = newdata)
#Return of transformed prediction
weight_pred <- exp(log_pred)
return(weight_pred)
}
#Mean calculation out of the function for centering variables
mean_gest <- mean(data$Gestazione, na.rm = TRUE)
mean_lung <- mean(data$Lunghezza, na.rm = TRUE)
predicted_weight <- predict_weight(
N_gravidanze = 3,
Fumatrici = "NonFumatrici",
Gestazione = 39,
Lunghezza = 490,
Cranio = 337,
Sesso = "F",
model = mod7_cent,
mean_gest = mean_gest,
mean_lung = mean_lung
)
Predicted birth weight: 3173 grams
The provided function can be used to generate different prediction examples by modifying the variable values in each call. In this case, considering a mother in her third pregnancy, non-smoker, with 39 weeks of gestation and a female newborn measuring 490 mm in length and 337 mm in head circumference, the model estimates a birth weight of approximately 3200 g.
No external validation or formal cross-validation of the predictive model has been conducted. However, these practices represent a fundamental step to evaluate the generalizability and robustness of the model on independent data or in different clinical settings. Their implementation would allow confirmation of the reliability of the estimates in different populations and further enhance confidence in the clinical use of the model.
Future analyses would be even more reliable by including external validation on independent datasets or cross-validation to assess and, if necessary, optimize predictive performance with a view to transferability and broader practical application.
This section presents the main results of the model, interpreted within the relevant clinical context, visualizing:
#Function to plot relationship with regressors and target variable
create_plots <- function(predict_func, model, mean_gest, mean_lung, fixed_vals_list) {
#Create levels for categorical variables
fumatrici_levels <- levels(model$model$Fumatrici)
sesso_levels <- levels(model$model$Sesso)
#Plot 1: Peso vs Gestazione and Fumatrici
gest_seq <- seq(25, 45, 0.5)
df1 <- expand.grid(Gestazione = gest_seq, Fumatrici = fumatrici_levels)
df1$Predicted <- mapply(function(g,f) {
predict_func(N_gravidanze = fixed_vals_list$p1$N_gravidanze,
Fumatrici = f,
Gestazione = g,
Lunghezza = fixed_vals_list$p1$Lunghezza,
Cranio = fixed_vals_list$p1$Cranio,
Sesso = fixed_vals_list$p1$Sesso,
model = model,
mean_gest = mean_gest,
mean_lung = mean_lung)
}, df1$Gestazione, df1$Fumatrici)
p1 <- ggplot(df1, aes(x=Gestazione, y=Predicted, color=Fumatrici)) +
geom_point() +
geom_smooth(method = "loess")+
labs(title="Pred. Weight vs Gestazione and Fumatrici") +
theme_classic()
#Plot 2: Peso vs Lunghezza and Fumatrici
lung_seq <- seq(300, 560, 0.5)
df2 <- expand.grid(Lunghezza = lung_seq, Fumatrici = fumatrici_levels)
df2$Predicted <- mapply(function(l,f) {
pred <- predict_func(N_gravidanze = fixed_vals_list$p2$N_gravidanze,
Fumatrici = f,
Gestazione = fixed_vals_list$p2$Gestazione,
Lunghezza = l,
Cranio = fixed_vals_list$p2$Cranio,
Sesso = fixed_vals_list$p2$Sesso,
model = model,
mean_gest = mean_gest,
mean_lung = mean_lung)
}, df2$Lunghezza, df2$Fumatrici)
p2 <- ggplot(df2, aes(x=Lunghezza, y=Predicted, color=Fumatrici)) +
geom_point() +
geom_smooth(method = "loess")+
labs(title="Pred.Weight vs Lunghezza and Fumatrici") +
theme_classic()
#Plot 3: Peso vs N.Gravidanze and Sesso
ng_seq <- seq(0, 10, 1)
df3 <- expand.grid(N_gravidanze = ng_seq, Sesso = sesso_levels)
df3$Predicted <- mapply(function(ng,s) {
predict_func(N_gravidanze = ng,
Fumatrici = fixed_vals_list$p3$Fumatrici,
Gestazione = fixed_vals_list$p3$Gestazione,
Lunghezza = fixed_vals_list$p3$Lunghezza,
Cranio = fixed_vals_list$p3$Cranio,
Sesso = s,
model = model,
mean_gest = mean_gest,
mean_lung = mean_lung)
}, df3$N_gravidanze, df3$Sesso)
p3 <- ggplot(df3, aes(x=N_gravidanze, y=Predicted, color=Sesso)) +
geom_point() +
geom_smooth(method = "loess")+
labs(title="Pred. Weight vs N.gravidanze and Sesso") +
theme_classic()
#Plot 4: Peso vs Cranio and Sesso
c_seq <- seq(230, 400, 0.5)
df4 <- expand.grid(Cranio = c_seq, Sesso = sesso_levels)
df4$Predicted <- mapply(function(c,s) {
predict_func(N_gravidanze = fixed_vals_list$p4$N_gravidanze,
Fumatrici = fixed_vals_list$p4$Fumatrici,
Gestazione = fixed_vals_list$p4$Gestazione,
Lunghezza = fixed_vals_list$p4$Lunghezza,
Cranio = c,
Sesso = s,
model = model,
mean_gest = mean_gest,
mean_lung = mean_lung)
}, df4$Cranio, df4$Sesso)
p4 <- ggplot(df4, aes(x=Cranio, y=Predicted, color=Sesso)) +
geom_point() +
geom_smooth(method = "loess")+
labs(title="Pred. Weight vs Cranio and Sesso") +
theme_classic()
return(list(p1, p2, p3, p4))
}
#Creation of fixed vals list
fixed_vals_list <- list(
p1 = list(N_gravidanze=3, Lunghezza=490, Cranio=337, Sesso="F"),
p2 = list(N_gravidanze=3, Gestazione=39, Cranio=337, Sesso="F"),
p3 = list(Fumatrici="NonFumatrici", Gestazione=39, Lunghezza=490, Cranio=337),
p4 = list(N_gravidanze=3, Fumatrici="NonFumatrici", Gestazione=39, Lunghezza=490)
)
#Execution and visualization
plots <- create_plots(predict_weight, mod7_cent, mean_gest, mean_lung, fixed_vals_list)
#Creation of wrap plots
wrap_plots(plots, ncol=2, nrow=2)
The results obtained from the predictive model clearly highlight the most influential maternal and biometric variables affecting newborn weight, providing useful tools for early evaluation of the risk of impaired growth.
Impact of gestation and smoking:
Gestational age is the primary factor determining the progressive increase in fetal weight. However, the distinction between smoking and non-smoking mothers reveals a clear weight gap, with infants of smoking mothers being significantly smaller. This finding confirms the need for targeted preventive actions aimed at smoking cessation during pregnancy to reduce the incidence of intrauterine growth restriction. Moreover, for high-risk cases (e.g., smoking mothers with advanced gestation but low predicted weight), it is essential to schedule closer monitoring and consider early availability of neonatal intensive care.
Role of biometric measurements:
Variables such as length and head circumference represent reliable indicators of overall fetal growth. The linear or near-linear correlations with weight suggest that, when integrated into routine ultrasounds, they can be used to develop personalized weight estimates at term. Such information allows identification of infants potentially at risk of underdevelopment or macrosomia, enabling optimization of clinical decisions regarding delivery and neonatal care.
Influence of maternal factors and sex:
The number of previous pregnancies shows a positive impact on weight, likely due to progressive maternal physiological adaptations. The differentiation by fetal sex, with males generally heavier, underscores the need to consider these characteristics when calculating personalized growth curves. These findings support a differentiated risk assessment for male and female newborns, aiding specific care protocols.
Clinical and operational implications:
#Plot of predicted vs real values of target variable
#Predicted values on original scale (exp of log prediction)
predicted_values <- exp(predict(mod7_cent))
#Create dataframe with index and both observed and predicted weights
df_long <- data.frame(
Index = 1:nrow(data),
Observed = data$Peso,
Predicted = predicted_values
) %>%
pivot_longer(cols = c("Observed", "Predicted"),
names_to = "Type", values_to = "Weight")
#Plot of Weight vs Index, colored by Type (Observed or Predicted)
ggplot(df_long, aes(x = Index, y = Weight, color = Type)) +
geom_point(alpha = 0.5) +
scale_color_manual(
values = c("Observed" = "blue", "Predicted" = "red"),
labels = c("Observed", "Predicted")
) +
labs(
title = "Comparison of Observed and Predicted Newborn Weight",
x = "Observation Index",
y = "Weight (grams)",
color = "Data Type"
) +
theme_classic()
The plot shows a good overall correspondence between the observed values (in blue) and the predicted values (in red) from the model across the entire series of observations. It can be seen that the model explains well the variability of most newborn weights within the central range of the distribution.
For extreme values, both at the lower and upper ends of the weight distribution, a greater discrepancy between observed and predicted values is observed. This phenomenon is consistent with the presence of heteroscedasticity highlighted by diagnostic tests, that is, a variation in the dispersion of prediction errors that increases with the expected value.
Moreover, the presence of biologically plausible outliers (such as very small newborns due to prematurity or very large newborns in cases of macrosomia or multiple pregnancies) makes it more difficult for the model to provide precise estimates in these extreme ranges. These cases represent natural clinical variability and not measurement errors, and their influence contributes to greater uncertainty in the predictions.
The practical implications to consider are as follows:
#Function to plot residuals of model vs groups of categorical variables of model
std_res_plot <- function(data, x, y) {
ggplot(data, aes(x = {{x}}, y = {{y}}, fill = {{x}})) +
geom_boxplot() +
labs(
title = paste("Standardized Residuals by", deparse(substitute(x))),
x = deparse(substitute(x)),
y = deparse(substitute(y))
) +
theme_classic()
}
#Add residuals of model
data$residuals_std <- rstandard(mod7_cent)
#Plot of results
p1 <- std_res_plot(data, Fumatrici, residuals_std)
p2 <- std_res_plot(data, Sesso, residuals_std)
p1 + p2 + plot_layout(ncol = 2)
The analysis of residuals stratified by groups confirms that the model does not introduce systematic bias related to the considered categorical variables (smoking status and sex). The distribution of prediction errors is similar across groups, with medians close to zero and comparable spread of residual variability. The presence of some outliers, evenly distributed among categories, indicates that difficult-to-predict cases are not associated with specific groups but rather reflect natural biological variability. These results support what has been previously discussed, namely that the model performs well on most data and that larger discrepancies mainly occur in correspondence with rare and anomalous but biologically plausible values.
The analysis conducted with the predictive model allows for a reliable estimation of neonatal weight in most cases, providing an objective support for the early assessment of the risk of altered fetal growth. The model highlights the importance of maternal and biometric variables, such as gestational age, smoking status, infant sex, length, and head circumference, in determining the estimated weight.
The residual analysis stratified by smoking status and sex groups showed no systematic bias, confirming that the model maintains good predictive equity across these categories. However, greater uncertainty emerged in the estimation for extreme weight values, i.e., in rare but biologically plausible cases. This uncertainty requires the qualified intervention of clinical expertise, which must validate and integrate the estimates especially in cases of very low or very high predicted weight, or in the presence of significant maternal risk factors.
At the operational level, particular attention is recommended for:
In terms of model improvement, the systematic collection of longitudinal fetal growth data and the inclusion of additional relevant clinical variables could increase the predictive accuracy and robustness of the model. The collection of repeated biometric parameters over time, as well as more detailed information on maternal conditions, would facilitate the development of dynamic and personalized models, with a potential positive impact on prenatal care pathways.