The project develops a statistical model to predict newborn birth weight using data from three hospitals. The goal is to improve high-risk pregnancy care, optimize resources, and reduce neonatal complications through timely intervention and risk factor identification.
The dataset contains continuous numeric variables (birth weight, length, head circumference), discrete numeric variables (maternal age, pregnancies, gestational age), and nominal qualitative variables (maternal smoking, delivery type, hospital, infant sex). Birth weight is the target variable.
#Load dataset
data <- read.csv("C:/Users/matmi/OneDrive/Desktop/ProgettiR/neonati.csv")
attach(data)
#Factorize qualitative variables
data$Fumatrici <- factor(data$Fumatrici, levels = c(0,1), labels = c("NonFumatrici", "Fumatrici"))
data$Tipo.parto <- factor(data$Tipo.parto, levels = c("Nat", "Ces"))
data$Ospedale <- factor(data$Ospedale, levels = c("osp1", "osp2", "osp3"))
data$Sesso <- factor(data$Sesso, levels = c("M", "F"))
In this phase the variables of the dataset are analyzed based on their typology.
The distributions are consistent with clinical expectations and show similar asymmetries between anthropometric variables. This suggests correlation of biological data.
#Continue quantitative variables of dataset
cont_vars <- c("Peso", "Lunghezza", "Cranio")
#Function for statistical index calculation
calc_stats <- function(x) {
c(
mean = round(mean(x, na.rm = TRUE), 2),
median = round(median(x, na.rm = TRUE), 2),
sd = round(sd(x, na.rm = TRUE), 2),
skew = round(skewness(x, na.rm = TRUE), 2),
min = round(min(x, na.rm = TRUE), 2),
max = round(max(x, na.rm = TRUE), 2)
)
}
print_stat_summary <- function(data, vars, caption) {
summary_mat <- sapply(data[, vars], calc_stats)
kable(summary_mat, caption = caption)
}
print_stat_summary(data, cont_vars, "Matrix of statistical index for continuous quantitative variables")
| Peso | Lunghezza | Cranio | |
|---|---|---|---|
| mean | 3284.08 | 494.69 | 340.03 |
| median | 3300.00 | 500.00 | 340.00 |
| sd | 525.04 | 26.32 | 16.43 |
| skew | -0.65 | -1.51 | -0.79 |
| min | 830.00 | 310.00 | 235.00 |
| max | 4930.00 | 565.00 | 390.00 |
The target variable (Weight) is tested for normality. The Shapiro-Wilk test rejects normality, as expected in large samples. The Q-Q plot shows approximate symmetry. Since linear regression requires normality of the residuals, which will be assessed during modeling, it is considered appropriate to approach this modeling.
#Investigate normality for target variable (Peso)
#Shapiro-Wilk test
shapiro_test <- shapiro.test(Peso)
shapiro_df <- data.frame(
Test = "Shapiro-Wilk",
Statistic = round(shapiro_test$statistic, 2),
P_Value = round(shapiro_test$p.value, 3),
Conclusion = ifelse(shapiro_test$p.value > 0.05,
"Fail to reject H0 (normality)",
"Reject H0 (not normal)")
)
#Shapiro-Wilk test results for target variable
kable(shapiro_df, caption = "Results of Shapiro-Wilk test on target variable", row.names = TRUE)
| Test | Statistic | P_Value | Conclusion | |
|---|---|---|---|---|
| W | Shapiro-Wilk | 0.97 | 0 | Reject H0 (not normal) |
#Q-Q plot
ggplot(data, aes(sample = Peso)) +
stat_qq(color = "blue") + stat_qq_line(color = "red") +
labs(title = "Q-Q plot of birth weight (Peso)",
x = "Theoretical quantiles",
y = "Sample quantiles") +
theme_classic()
Outliers with maternal age close to zero are observed; these are considered errors and removed during the cleaning phase. The distributions reflect the expected clinical characteristics: maternal age around 30 years, pregnancies skewed toward low values, and gestational age with a left tail for premature births.
#Discrete quantitative variables of dataset
disc_vars <- c("Anni.madre", "N.gravidanze", "Gestazione")
#Histogram plots
plot_multi_hist <- function(data, vars) {
plots <- lapply(vars, function(v) {
ggplot(data, aes_string(x = v)) +
geom_bar(fill = "blue", color = "black") +
labs(title = paste("Freq. of", v)) +
theme_classic()
})
gridExtra::grid.arrange(grobs = plots, nrow = 1)
}
plot_multi_hist(data, disc_vars)
The prevalence of smoking is low, and the gender distribution is balanced. Maternal smoking is considered a risk factor, while the hospital and delivery method are variables that are not expected to be clinically relevant and are not analyzed.
#Function to calculate contingency table
desc_qual_df <- function(v) {
tab <- table(data[[v]])
perc <- round(100 * prop.table(tab), 1)
df <- data.frame(
Variable = v,
Level = names(tab),
Count = as.numeric(tab),
Percentage = perc,
stringsAsFactors = FALSE
)
return(
kable(
cbind(df[, 1:3], df[,5]),
caption = paste("Frequency distribution of variable", v),
col.names = c("Variable", "Level", "Count", "Percentage")
) %>%
kable_styling(full_width = FALSE, position = "left")
)
}
#Contingency table results for variable Fumatrici
desc_qual_df("Fumatrici")
| Variable | Level | Count | Percentage |
|---|---|---|---|
| Fumatrici | NonFumatrici | 2396 | 95.8 |
| Fumatrici | Fumatrici | 104 | 4.2 |
#Contingency table results for variable Sesso
desc_qual_df("Sesso")
| Variable | Level | Count | Percentage |
|---|---|---|---|
| Sesso | M | 1244 | 49.8 |
| Sesso | F | 1256 | 50.2 |
In this section, we analyze, through tabular displays, the outliers of the variables with particular clinical significance and with a probable greater influence on the analysis.
#Function to calculate outliers
calc_outliers <- function(x, varname = "variable") {
Q1 <- quantile(x, 0.25, na.rm = TRUE)
Q3 <- quantile(x, 0.75, na.rm = TRUE)
IQR <- Q3 - Q1
lower_bound <- Q1 - 1.5 * IQR
upper_bound <- Q3 + 1.5 * IQR
which_outliers_low <- which(x < lower_bound)
which_outliers_up <- which(x > upper_bound)
if(length(which_outliers_low) + length(which_outliers_up) == 0) {
df <- data.frame(
Variable = varname,
Number_low_outliers = 0,
Number_high_outliers = 0,
Percent_low_outliers = 0,
Percent_high_outliers = 0,
Conclusion = "No outliers detected",
stringsAsFactors = FALSE
)
} else {
n_outliers_low <- length(which_outliers_low)
n_outliers_up <- length(which_outliers_up)
total <- sum(!is.na(x))
percent_outliers_low <- 100 * n_outliers_low / total
percent_outliers_up <- 100* n_outliers_up / total
conclusion <- ifelse(percent_outliers_low + percent_outliers_up > 5, "High number of outliers", "Acceptable number of outliers")
df <- data.frame(
Variable = varname,
Number_low_outliers = n_outliers_low,
Number_high_outliers = n_outliers_up,
Percent_low_outliers = percent_outliers_low,
Percent_high_outliers = percent_outliers_up,
Conclusion = conclusion,
stringsAsFactors = FALSE
)
}
return(
kable(df, caption = paste("Table of number, percentage and conclusion for outliers of variable", varname)) %>%
kable_styling(full_width = FALSE, position = "left")
)
}
Outliers are evaluated primarily for the target variable (Peso).
#Calculate outliers of Peso
calc_outliers(data[, "Peso"], varname = "Peso")
| Variable | Number_low_outliers | Number_high_outliers | Percent_low_outliers | Percent_high_outliers | Conclusion |
|---|---|---|---|---|---|
| Peso | 55 | 14 | 2.2 | 0.56 | Acceptable number of outliers |
Outliers of anthropometric measurements are checked to verify the concordance of their distributions with the target variable.
#Calculate outliers of Lunghezza
calc_outliers(data[, "Lunghezza"], varname = "Lunghezza")
| Variable | Number_low_outliers | Number_high_outliers | Percent_low_outliers | Percent_high_outliers | Conclusion |
|---|---|---|---|---|---|
| Lunghezza | 56 | 3 | 2.24 | 0.12 | Acceptable number of outliers |
#Calculate outliers of Cranio
calc_outliers(data[, "Cranio"], varname = "Cranio")
| Variable | Number_low_outliers | Number_high_outliers | Percent_low_outliers | Percent_high_outliers | Conclusion |
|---|---|---|---|---|---|
| Cranio | 37 | 11 | 1.48 | 0.44 | Acceptable number of outliers |
Outliers also occur for gestational weeks, potentially clinically influential on weight.
#Calculate outliers of Gestazione
calc_outliers(data[, "Gestazione"], varname = "Gestazione")
| Variable | Number_low_outliers | Number_high_outliers | Percent_low_outliers | Percent_high_outliers | Conclusion |
|---|---|---|---|---|---|
| Gestazione | 67 | 0 | 2.68 | 0 | Acceptable number of outliers |
Similar anomalous trends are found in anthropometric measurements and gestational age, with mostly low values, likely corresponding to premature births. The absence of high anomalous values for gestational age, combined with high measurements, suggests a saturation effect, which should be explored in the modeling.
Maternal age values <10, biologically implausible, are removed without impacting dataset quality.
#Function to find and remove error in variable Anni.Madre
remove_rows_condition <- function(data, var, condition_func) {
rows_to_remove <- which(condition_func(data[[var]]))
data_filtered <- data[-rows_to_remove, ]
return(data_filtered)
}
data <- remove_rows_condition(data, "Anni.madre", function(x) x < 10)
In the second phase of the analysis, specific hypotheses of clinical and statistical relevance are tested to better understand the relationships between qualitative and quantitative variables in the dataset.
The test shows that there is no evidence supporting significant differences in the frequency of cesarean deliveries between hospitals. Therefore, in the predictive modeling of neonatal weight, there is no statistical justification to include the Hospital variable or its indirect effects through Type of delivery.
#Function to calculate contingency table for Tipo.parto and Ospedale
calc_tab_cont <- function(var1, var2) {
cont_tab <- table(var1, var2)
return(cont_tab)
}
cont_tab <- calc_tab_cont(data$Tipo.parto, data$Ospedale)
kable(cont_tab, caption = "Contingency table for variables Tipo.parto and Ospedale")
| osp1 | osp2 | osp3 | |
|---|---|---|---|
| Nat | 574 | 594 | 602 |
| Ces | 242 | 254 | 232 |
#Chi-squared test execution
result_t1 <- chisq.test(cont_tab)
test_ces_df <- data.frame(
Method = result_t1$method,
Statistic = round(as.numeric(result_t1$statistic), 2),
P_Value = round(result_t1$p.value, 3)
)
kable(test_ces_df, caption = "Result of chi-squared test on on the contingency table between Tipo.parto and Ospedale")
| Method | Statistic | P_Value |
|---|---|---|
| Pearson’s Chi-squared test | 1.08 | 0.582 |
Peso and Lunghezza are compared to the 2024 Cedap dataset reference values. Given the continuous nature of the variables, the large sample size, and only minor deviations from normality, is used the one-sample t-test.
#Function to test Peso and Lunghezza then same population variable
compare_to_population <- function(x, mu, alpha = 0.05, varname = "variable") {
x <- na.omit(x)
#T test
t_res <- t.test(x, mu = mu)
#Summary of results
result <- data.frame(
parametric_test = "One-sample t-test",
p_value_ptest = round(t_res$p.value, 3),
conclusion_ptest = ifelse(t_res$p.value < alpha,
"Reject H0. Target variable is statistically different from population variable",
"Fail to Reject H0. Target variable is not statistically different from population variable"),
stringsAsFactors = FALSE
)
return(
kable(result,
caption = paste("Summary table of results for comparison between", varname, "of sample and population value"),
col.names = c("Parametric test", "P-value", "Conclusion")) %>%
kable_styling(full_width = FALSE, position = "left")
)
}
#Test for weight and length
weight_ref <- 3300
length_ref <- 500
compare_to_population(data$Peso, mu = weight_ref, varname = "Peso")
| Parametric test | P-value | Conclusion |
|---|---|---|
| One-sample t-test | 0.132 | Fail to Reject H0. Target variable is not statistically different from population variable |
compare_to_population(data$Lunghezza, mu = length_ref, varname = "Lunghezza")
| Parametric test | P-value | Conclusion |
|---|---|---|
| One-sample t-test | 0 | Reject H0. Target variable is statistically different from population variable |
Weight shows no significant differences from the reference mean (3300 g). Length, however, shows a significant difference from the reference value (500 mm). The discrepancy in length could reflect specific effects emerging from the data set, investigated during the modeling phase.
In this section are tested differences between anthropometric variables among Sesso. Normality of the variables within groups is considered satisfied cause the large size of sample. Homogeneity of variances is tested with Leven test. Since the variables (Peso, Lunghezza, Cranio) are continuous quantitative measures with approximately normal distributions, parametric tests can be used to assess differences between Sex groups.
cat_for_test <- c("Sesso")
num_for_test <- c("Peso", "Lunghezza", "Cranio")
test_group_diff <- function(data, num_var, cat_var) {
x <- data[[num_var]]
g <- data[[cat_var]]
df <- data.frame(x = x, g = g) %>% filter(!is.na(x), !is.na(g))
n_groups <- length(unique(df$g))
#Omogeneity variance test (Levene)
levene_p <- car::leveneTest(x ~ g, data = df)$"Pr(>F)"[1]
variance_homogeneous <- (levene_p > 0.05)
if (n_groups == 2) {
if (variance_homogeneous) {
test <- t.test(x ~ g, data = df, var.equal = TRUE)
test_name <- "t-test (equal variance)"
} else {
test <- t.test(x ~ g, data = df, var.equal = FALSE)
test_name <- "Welch t-test (different variance)"
}
} else {
return(NA)
}
param_pvalue <- test$p.value
return(list(
test = test_name,
param_pvalue = param_pvalue
))
}
#Return of test results
results_tests <- expand.grid(num_var = num_for_test, cat_var = cat_for_test, stringsAsFactors = FALSE)
results_tests <- results_tests %>%
rowwise() %>%
mutate(test_result = list(test_group_diff(data, num_var, cat_var))) %>%
mutate(
test = test_result$test,
param_pvalue = round(test_result$param_pvalue, 3),
param_significance = case_when(
param_pvalue < 0.001 ~ "***",
param_pvalue < 0.01 ~ "**",
param_pvalue < 0.05 ~ "*",
TRUE ~ ""
)) %>%
ungroup() %>%
dplyr::select(num_var, cat_var, test, param_pvalue, param_significance)
#Print of results for Peso, Lunghezza and Cranio among Sesso
kable(
results_tests[1:3,],
caption = "Statistical test results for differences beetwen groups of variable Sesso",
col.names = c("Numerical variable", "Categorical variable", "Parametric test", "P-value pametric test", "Significance parametric test")
)
| Numerical variable | Categorical variable | Parametric test | P-value pametric test | Significance parametric test |
|---|---|---|---|---|
| Peso | Sesso | t-test (equal variance) | 0 | *** |
| Lunghezza | Sesso | Welch t-test (different variance) | 0 | *** |
| Cranio | Sesso | t-test (equal variance) | 0 | *** |
The results show that for all three anthropometric variables analyzed there are statistically significant differences between the groups defined by sex. These results suggest that the variable Sesso is an important discriminative factor to include as a covariate.
This section performs a correlation analysis among the main variables of biological interest.
#Variable for correlation analysis
vars_corr <- c("Peso", "Lunghezza", "Cranio")
#Plot of correlation matrix for anthropometric variables
ggpairs(data, columns = vars_corr,
aes(color = Fumatrici),
lower = list(continuous = wrap("smooth_loess", alpha = 0.3, size = 0.5)),
upper = list(continuous = wrap("cor", size = 5)),
diag = list(continuous = wrap("densityDiag"))) +
theme_classic() +
theme(axis.text.x = element_text(angle = 60, hjust = 1)) +
ggtitle("Correlation matrix colored by Fumatrici")
Anthropometric measurements are highly correlated, indicating the biological coherence of the sample. Maternal smoking appears to affect length more than the other measurements.
#Function for scatterplot
plot_scatter <- function(data, wrap_var) {
facet_formula <- as.formula(paste("~", wrap_var))
ggplot(data, aes(x = Gestazione, y = Peso)) +
geom_point() +
geom_smooth(method = "loess") +
theme_classic() +
facet_wrap(facet_formula) +
ggtitle(paste("Relationship between Peso and Gestazione by", wrap_var)) +
labs(x = "Gestazione", y = "Peso")
}
#Plot of relationship between Peso and Gestazione among Fumatrici
plot_scatter(data, "Fumatrici")
Clustering by maternal smoking status highlights a possible smoking-associated alteration in fetal growth, which should be taken into account in the multivariate model.
#Plot of relationship between Peso and Gestazione among Sesso
plot_scatter(data, "Sesso")
Gender influences growth rate, with males showing faster weight gain. The relationship between weight and gestational age is not perfectly linear, confirming the saturation effect.
Gestational age, sex, maternal smoking, maternal characteristics, and anthropometric variables will be included as predictors of birth weight for the initial model.
The initial linear regression model includes biologically relevant variables for predicting newborn birth weight, excluding delivery type and hospital variables, which lack biological significance.
#First model with variable biologically significant
mod1 <- lm(Peso ~ Lunghezza + Cranio + Gestazione + Anni.madre + N.gravidanze + Fumatrici + Sesso, data = data)
#Function for regression model coefficients
coef_fun <- function(mod) {
summary_mod <- summary(mod)
coef_tab <- as.data.frame(summary_mod$coefficients)
coef_tab$p_value <- coef_tab$`Pr(>|t|)`
coef_tab$Significance <- with(coef_tab, ifelse(
p_value < 0.001, "***",
ifelse(p_value < 0.01, "**",
ifelse(p_value < 0.05, "*", ""))
))
coef_tab <- coef_tab[, c("Estimate", "Std. Error", "t value", "p_value", "Significance")]
coef_tab[,1:3] <- round(coef_tab[,1:3], 2)
coef_tab$p_value <- round(coef_tab$p_value, 3)
coef_tab_kable <- kable(coef_tab, caption = "Coefficient summary of initial regression model") %>%
kable_styling(full_width = FALSE, position = "left")
return(coef_tab_kable)
}
#Print of regression model coefficients of the first model
coef_fun(mod1)
| Estimate | Std. Error | t value | p_value | Significance | |
|---|---|---|---|---|---|
| (Intercept) | -6634.16 | 143.24 | -46.31 | 0.000 | *** |
| Lunghezza | 10.23 | 0.30 | 33.98 | 0.000 | *** |
| Cranio | 10.52 | 0.43 | 24.63 | 0.000 | *** |
| Gestazione | 32.95 | 3.83 | 8.61 | 0.000 | *** |
| Anni.madre | 0.88 | 1.15 | 0.77 | 0.444 | |
| N.gravidanze | 11.38 | 4.68 | 2.43 | 0.015 |
|
| FumatriciFumatrici | -30.40 | 27.61 | -1.10 | 0.271 | |
| SessoF | -78.08 | 11.21 | -6.96 | 0.000 | *** |
#Function to verify multicollinearity
vif_fun <- function(mod) {
vif_values <- vif(mod)
#VIF table
if (is.matrix(vif_values)) {
vif_tab <- data.frame(
Variable = rownames(vif_values),
GVIF = round(vif_values[,"GVIF"], 2),
GVIF_corr = round(vif_values[,"GVIF^(1/(2*Df))"], 2)
)
} else {
vif_tab <- data.frame(
Variable = names(vif_values),
VIF = round(vif_values, 2)
)
}
vif_kable <- kable(vif_tab, caption = "Variance Inflation Factor (VIF)")%>%
kableExtra::kable_styling(full_width = FALSE, position = "left")
return(vif_kable)
}
In the initial specification, anthropometric measurements and gestational age are strong positive predictors of birth weight, while female newborns have a lower mean birth weight than males. Maternal age and smoking status do not appear to significantly influence the target variable.
#Print of VIF table
vif_fun(mod1)
| Variable | VIF | |
|---|---|---|
| Lunghezza | Lunghezza | 2.08 |
| Cranio | Cranio | 1.63 |
| Gestazione | Gestazione | 1.69 |
| Anni.madre | Anni.madre | 1.19 |
| N.gravidanze | N.gravidanze | 1.19 |
| Fumatrici | Fumatrici | 1.01 |
| Sesso | Sesso | 1.04 |
Given the observed correlation between the anthropometric measurements, multicollinearity is assessed and found not to be problematic based on the VIF values.
Maternal age is removed as it is not statistically significant and the comparison between the models confirms that its exclusion does not worsen the model fit.
#Second model with variable Anni.madre elimination
mod2 <- update(mod1, ~. -Anni.madre)
#Print of regression model coefficients of the second model
coef_fun(mod2)
| Estimate | Std. Error | t value | p_value | Significance | |
|---|---|---|---|---|---|
| (Intercept) | -6604.10 | 137.75 | -47.94 | 0.000 | *** |
| Lunghezza | 10.23 | 0.30 | 33.98 | 0.000 | *** |
| Cranio | 10.54 | 0.43 | 24.71 | 0.000 | *** |
| Gestazione | 32.64 | 3.81 | 8.57 | 0.000 | *** |
| N.gravidanze | 12.70 | 4.35 | 2.92 | 0.004 | ** |
| FumatriciFumatrici | -30.57 | 27.60 | -1.11 | 0.268 | |
| SessoF | -78.16 | 11.21 | -6.97 | 0.000 | *** |
#Function to compare results of nested models
anova_fun <- function(mod_simple, mod_complex) {
terms_simple <- attr(terms(mod_simple), "term.labels")
terms_complex <- attr(terms(mod_complex), "term.labels")
is_nested <- all(terms_simple %in% terms_complex)
if (!is_nested) {
cat("Models are NOT nested. ANOVA test is not possible.\n")
return(NULL)
}
anova_res <- round(anova(mod_simple, mod_complex), 2)
anova_df <- as.data.frame(anova_res) %>%
mutate(Model = c("More simple model", "More complex model")) %>%
dplyr::select(Model, everything())
anova_kable <- kable(anova_df, caption = "Test ANOVA results", col.names = c("Model", "Resid.Df", "Residual Sum Sq", "Df (Diff)", "Sum of Sq (Diff)", "F value", "P-value")) %>%
kableExtra::kable_styling(full_width = FALSE, position = "left")
return(anova_kable)
}
#Print of ANOVA test results
anova_fun(mod2, mod1)
| Model | Resid.Df | Residual Sum Sq | Df (Diff) | Sum of Sq (Diff) | F value | P-value |
|---|---|---|---|---|---|---|
| More simple model | 2491 | 187949505 | NA | NA | NA | NA |
| More complex model | 2490 | 187905214 | 1 | 44291.73 | 0.59 | 0.44 |
Based on the exploratory analysis, a potential interaction between maternal smoking and gestational age is tested. The interaction did not improve model fit according to AIC and BIC criteria and was therefore excluded.
#Model with interaction between Fumatrici and Gestazione
data$Gestazione_c <- scale(data$Gestazione, center = TRUE, scale = FALSE)
mod3 <- lm(Peso ~ Lunghezza + Cranio + N.gravidanze + Fumatrici*Gestazione_c + Sesso, data = data)
#Print of regression model coefficients of the third model
coef_fun(mod3)
| Estimate | Std. Error | t value | p_value | Significance | |
|---|---|---|---|---|---|
| (Intercept) | -5326.80 | 154.98 | -34.37 | 0.000 | *** |
| Lunghezza | 10.23 | 0.30 | 33.96 | 0.000 | *** |
| Cranio | 10.53 | 0.43 | 24.69 | 0.000 | *** |
| N.gravidanze | 12.75 | 4.35 | 2.93 | 0.003 | ** |
| FumatriciFumatrici | -24.70 | 28.12 | -0.88 | 0.380 | |
| Gestazione_c | 33.20 | 3.84 | 8.64 | 0.000 | *** |
| SessoF | -78.74 | 11.22 | -7.02 | 0.000 | *** |
| FumatriciFumatrici:Gestazione_c | -21.05 | 19.28 | -1.09 | 0.275 |
#Function to test with method AIC and BIC
aic_bic_fun <- function(mod_prev, mod_new) {
aic_tab <- round(AIC(mod_prev, mod_new), 2)
bic_tab <- round(BIC(mod_prev, mod_new), 2)
aic_kable <- knitr::kable(aic_tab, caption = "Method AIC for best model chose") %>%
kableExtra::kable_styling(full_width = FALSE, position = "left")
bic_kable <- knitr::kable(bic_tab, caption = "Method BIC for best model chose") %>%
kableExtra::kable_styling(full_width = FALSE, position = "left")
return(list(AIC = aic_kable, BIC = bic_kable))
}
#Print of AIC and BIC test
aicbic_tables <- aic_bic_fun(mod2, mod3)
aicbic_tables$AIC
| df | AIC | |
|---|---|---|
| mod_prev | 8 | 35153.66 |
| mod_new | 9 | 35154.46 |
aicbic_tables$BIC
| df | BIC | |
|---|---|---|
| mod_prev | 8 | 35200.24 |
| mod_new | 9 | 35206.87 |
In line with the exploratory analysis, the interaction between maternal smoking and newborn length is retained, as it is statistically significant and improves model fit without introducing multicollinearity.
#Model with interaction between Fumatrici and Lunghezza
data$Lunghezza_c <- scale(data$Lunghezza, center = TRUE, scale = FALSE)
mod4 <- lm(Peso ~ Gestazione + Cranio + N.gravidanze + Fumatrici*Lunghezza_c + Sesso, data = data)
#Print of regression model coefficients of the fourth model
coef_fun(mod4)
| Estimate | Std. Error | t value | p_value | Significance | |
|---|---|---|---|---|---|
| (Intercept) | -1556.86 | 191.92 | -8.11 | 0.000 | *** |
| Gestazione | 32.98 | 3.81 | 8.66 | 0.000 | *** |
| Cranio | 10.54 | 0.43 | 24.73 | 0.000 | *** |
| N.gravidanze | 12.98 | 4.34 | 2.99 | 0.003 | ** |
| FumatriciFumatrici | -23.13 | 27.76 | -0.83 | 0.405 | |
| Lunghezza_c | 10.14 | 0.30 | 33.42 | 0.000 | *** |
| SessoF | -76.95 | 11.21 | -6.86 | 0.000 | *** |
| FumatriciFumatrici:Lunghezza_c | 2.98 | 1.28 | 2.33 | 0.020 |
|
#Print of AIC and BIC test
aicbic_tables <- aic_bic_fun(mod3, mod4)
aicbic_tables$AIC
| df | AIC | |
|---|---|---|
| mod_prev | 9 | 35154.46 |
| mod_new | 9 | 35150.21 |
aicbic_tables$BIC
| df | BIC | |
|---|---|---|
| mod_prev | 9 | 35206.87 |
| mod_new | 9 | 35202.62 |
#Print of VIF table
vif_fun(mod4)
| Variable | VIF | |
|---|---|---|
| Gestazione | Gestazione | 1.68 |
| Cranio | Cranio | 1.62 |
| N.gravidanze | N.gravidanze | 1.03 |
| Fumatrici | Fumatrici | 1.02 |
| Lunghezza_c | Lunghezza_c | 2.12 |
| Sesso | Sesso | 1.04 |
| Fumatrici:Lunghezza_c | Fumatrici:Lunghezza_c | 1.05 |
A nonlinear effect of gestational age is introduced to explain the observed pattern of saturation in birth weight.
A slight improvement in AIC is observed, suggesting greater accuracy, while BIC, which further penalizes model complexity, deteriorates marginally. The nonlinear interaction is significant and clinically relevant, and therefore it is decided to retain it. The VIF values do not indicate multicollinearity among the included variables.
#Model with non linear term for Gestazione
mod5 <- lm(Peso ~ Gestazione_c + I(Gestazione_c^2) + Cranio + N.gravidanze + Fumatrici*Lunghezza_c + Sesso, data = data)
#Print of regression model coefficients of the fifth model
coef_fun(mod5)
| Estimate | Std. Error | t value | p_value | Significance | |
|---|---|---|---|---|---|
| (Intercept) | -308.18 | 145.85 | -2.11 | 0.035 |
|
| Gestazione_c | 37.26 | 4.30 | 8.66 | 0.000 | *** |
| I(Gestazione_c^2) | 1.41 | 0.66 | 2.13 | 0.033 |
|
| Cranio | 10.63 | 0.43 | 24.84 | 0.000 | *** |
| N.gravidanze | 13.04 | 4.34 | 3.00 | 0.003 | ** |
| FumatriciFumatrici | -22.41 | 27.75 | -0.81 | 0.419 | |
| Lunghezza_c | 10.24 | 0.31 | 33.36 | 0.000 | *** |
| SessoF | -74.93 | 11.25 | -6.66 | 0.000 | *** |
| FumatriciFumatrici:Lunghezza_c | 2.81 | 1.28 | 2.20 | 0.028 |
|
#Print of AIC and BIC test
aicbic_tables <- aic_bic_fun(mod4, mod5)
aicbic_tables$AIC
| df | AIC | |
|---|---|---|
| mod_prev | 9 | 35150.21 |
| mod_new | 10 | 35147.65 |
aicbic_tables$BIC
| df | BIC | |
|---|---|---|
| mod_prev | 9 | 35202.62 |
| mod_new | 10 | 35205.89 |
#Print of VIF table
vif_fun(mod5)
| Variable | VIF | |
|---|---|---|
| Gestazione_c | Gestazione_c | 2.15 |
| I(Gestazione_c^2) | I(Gestazione_c^2) | 1.83 |
| Cranio | Cranio | 1.64 |
| N.gravidanze | N.gravidanze | 1.03 |
| Fumatrici | Fumatrici | 1.02 |
| Lunghezza_c | Lunghezza_c | 2.17 |
| Sesso | Sesso | 1.05 |
| Fumatrici:Lunghezza_c | Fumatrici:Lunghezza_c | 1.05 |
The model coefficients displayed are all statistically significant and show that:
#Calculation of R^2 and RMSE of the best model
R_squared <- summary(mod5)$r.squared
R_squared <- round(R_squared, 2)
#Calculation of RMSE of the best model
predicted <- predict(mod5)
actual <- data$Peso
RMSE <- sqrt(mean((actual - predicted)^2))
RMSE <- round(RMSE, 2)
#Table of R^2 and RMSE of the best model
df_res <- data.frame(R_squared, RMSE)
knitr::kable(df_res, caption = "Table of R^2 and RMSE of the best model")
| R_squared | RMSE |
|---|---|
| 0.73 | 273.75 |
The model explains approximately 73% of the variability in birth weight and has a root mean square prediction error of about 274 grams, which is relatively small compared to the sample mean weight of approximately 3300 grams.
#Function for test assumption for model residuals
mod_diag <- function(mod) {
bp_test <- lmtest::bptest(mod)
dw_test <- lmtest::dwtest(mod)
shapiro_test <- shapiro.test(residuals(mod))
#Diagnostic test table
diag_tab <- data.frame(
Test = c("Breusch-Pagan (Heteroscedasticity)",
"Durbin-Watson (Autocorrelation)",
"Shapiro-Wilk (Normality of Residuals)"),
Statistic = c(round(bp_test$statistic, 2),
round(dw_test$statistic, 2),
round(shapiro_test$statistic, 2)),
p_value = c(format.pval(bp_test$p.value, digits = 3, eps = 1e-3),
format.pval(dw_test$p.value, digits = 3, eps = 1e-3),
format.pval(shapiro_test$p.value, digits = 3, eps = 1e-3)),
stringsAsFactors = FALSE
)
kable_diag <- kable(diag_tab, caption = "Diagnostic Tests Summary")%>%
kableExtra::kable_styling(full_width = FALSE, position = "left")
return(kable_diag)
}
#Print of diagnosis results
mod_diag(mod5)
| Test | Statistic | p_value | |
|---|---|---|---|
| BP | Breusch-Pagan (Heteroscedasticity) | 103.55 | <0.001 |
| DW | Durbin-Watson (Autocorrelation) | 1.95 | 0.123 |
| W | Shapiro-Wilk (Normality of Residuals) | 0.97 | <0.001 |
Diagnostic tests reveal significant heteroskedasticity, no residual autocorrelation, and residual deviation from normality, as determined by the Breusch-Pagan, Durbin-Watson, and Shapiro-Wilk tests, respectively. However, the large sample size likely reduces the impact of non-normality on inference.
#Plot of residuals vs fitted
df_resid <- data.frame(
Fitted = fitted(mod5),
Residuals = resid(mod5),
StdResid = rstandard(mod5),
Index = 1:length(resid(mod5))
)
#Residuals vs fitted
ggplot(df_resid, aes(x = Fitted, y = Residuals)) +
geom_point(alpha=0.5) +
geom_hline(yintercept = 0, linetype="dashed") +
geom_smooth(method = "loess", se=FALSE, color="red") +
labs(title = "Residuals vs predicted values", x = "Predicted values", y = "Residuals") +
theme_classic()
The plot of studentized residuals versus predicted values shows increasing prediction variability with higher expected weights, confirming heteroscedasticity.
#Q-Q plot
ggplot(df_resid, aes(sample = StdResid)) +
stat_qq(color = "blue") + stat_qq_line(color = "red") +
labs(title = "Q-Q plot of model student residuals",
x = "Theoretical quantiles",
y = "Sample quantiles") +
theme_classic()
The Q-Q plot shows normal behavior for most values (especially the central ones) and highlights deviations from normality mainly in the tails, particularly for the higher weight values, supporting the observed heteroscedasticity.
To investigate residual heteroscedasticity, outliers, high leverage, and influential points are visualized in the plot showing the relationship between studentized residuals and predicted values of the target variable.
influential_fun <- function(df_resid, mod) {
n <- length(df_resid$Residuals)
p <- length(coefficients(mod))
data_diag <- data.frame(
Index = 1:n,
Leverage = hatvalues(mod),
RStudent = df_resid$StdResid,
CooksDistance = cooks.distance(mod)
)
#Cutoff for influential points
leverage_cutoff <- 2 * p / n
rstudent_cutoff <- 2
cook_cutoff <- 4 / (n - p)
#Flag influential points
data_diag <- data_diag %>%
mutate(
HighLeverage = Leverage > leverage_cutoff,
Outlier = abs(RStudent) > rstudent_cutoff,
Influential = CooksDistance > cook_cutoff
)
#Influential points table
influential_points <- data_diag %>%
filter(HighLeverage | Outlier | Influential) %>%
arrange(desc(CooksDistance)) %>%
dplyr::select(Index, Leverage, RStudent, CooksDistance, HighLeverage, Outlier, Influential)
#Return influential points
return(influential_points)
}
#Influential points
infl_return <- influential_fun(df_resid, mod5)
HighLeverage <- infl_return$HighLeverage
Outlier <- infl_return$Outlier
Influetial <- infl_return$Influential
data_with_flags <- data %>%
mutate(Index = row_number()) %>%
left_join(infl_return, by = "Index")
df_resid_full <- data.frame(
Index = 1:nrow(data_with_flags),
Fitted = fitted(mod5),
StdResid = rstandard(mod5)
) %>%
left_join(infl_return, by = "Index")
#Function for influential points
plot_single_type <- function(data, flag_var, color, shape = 16, title = "") {
ggplot(data, aes_string(x = "Fitted", y = "StdResid")) +
geom_point(alpha = 0.3, color = "grey") +
geom_point(data = filter(data, !!rlang::sym(flag_var)),
aes(x = Fitted, y = StdResid),
color = color, shape = shape, size = 3, alpha = 0.8) +
geom_smooth(method = "loess", se = FALSE) +
theme_classic() +
labs(title = title, x = "Predicted values", y = "Std Residuals")
}
p1 <- plot_single_type(df_resid_full, "HighLeverage", "red", 16, "High Leverage Points")
p2 <- plot_single_type(df_resid_full, "Outlier", "blue", 1, "Outlier Points")
p3 <- plot_single_type(df_resid_full, "Influential", "black", 4, "Influential Points")
#Plot of influential
combined_plot <- p1 | p2 | p3
print(combined_plot)
Influence points and outliers are mainly concentrated at higher predicted values and show biologically plausible values. This pattern appears to reflect the natural variability in infants with larger body sizes. Therefore, modeling approaches that account for non-constant variance, such as generalized linear models (GLMs) with appropriate link functions, such as the logarithmic link, could be considered.
This section tests the model output by simulating the prediction result for selected values of the regressors.
#Prediction with some example values
predict_weight <- function(N_gravidanze, Fumatrici, Gestazione, Lunghezza, Cranio, Sesso, model, mean_gest, mean_lung) {
#Control of factorial levels
if(!Fumatrici %in% levels(model$model$Fumatrici)) {
stop("Invalid value for Fumatrici")
}
if(!Sesso %in% unique(model$model$Sesso)) {
stop("Invalid value for Sesso")
}
#Centering of variable
Gestazione_c <- as.double(Gestazione - mean_gest)
Gestazione_c2 <- as.double(Gestazione_c^2)
Lunghezza_c <- as.double(Lunghezza - mean_lung)
#Creation of matrix 1x1 as the same class of variables in model
Gestazione_c <- matrix(Gestazione_c, nrow = 1, ncol = 1)
Gestazione_c2 <- matrix(Gestazione_c2, nrow = 1, ncol = 1)
Lunghezza_c <- matrix(Lunghezza_c, nrow = 1, ncol = 1)
#New data
newdata <- data.frame(
N.gravidanze = as.numeric(N_gravidanze),
Fumatrici = factor(Fumatrici, levels = levels(model$model$Fumatrici)),
Cranio = as.numeric(Cranio),
Sesso = factor(Sesso, levels = levels(model$model$Sesso))
)
#Add matrix for variables Gestazione_c, Gestazione_c2 and Lunghezza_c
newdata$Gestazione_c <- I(Gestazione_c)
newdata$Gestazione_c2 <- I(Gestazione_c2)
newdata$Lunghezza_c <- I(Lunghezza_c)
#Logaritmic prediction
w_pred <- predict(model, newdata = newdata)
#Return of transformed prediction
weight_pred <- w_pred
return(weight_pred)
}
#Mean calculation out of the function for centering variables
mean_gest <- mean(data$Gestazione, na.rm = TRUE)
mean_lung <- mean(data$Lunghezza, na.rm = TRUE)
predicted_weight <- predict_weight(
N_gravidanze = 3,
Fumatrici = "NonFumatrici",
Gestazione = 39,
Lunghezza = 490,
Cranio = 337,
Sesso = "F",
model = mod5,
mean_gest = mean_gest,
mean_lung = mean_lung
)
In this case, this is the result of prediction. Predicted birth weight: 3190 grams
To convey the most relevant model findings, the graphical representations focus on the clinically most influential variables for predicting birth weight and on the statistically significant interaction terms retained in the final model.
#Function to plot relationship with regressors and target variable
create_plots <- function(predict_func, model, mean_gest, mean_lung, fixed_vals_list) {
#Create levels for categorical variables
fumatrici_levels <- levels(model$model$Fumatrici)
sesso_levels <- levels(model$model$Sesso)
#Plot 1: Peso vs Gestazione and Fumatrici
gest_seq <- seq(25, 45, 0.5)
df1 <- expand.grid(Gestazione = gest_seq, Fumatrici = fumatrici_levels)
df1$Predicted <- mapply(function(g,f) {
predict_func(N_gravidanze = fixed_vals_list$p1$N_gravidanze,
Fumatrici = f,
Gestazione = g,
Lunghezza = fixed_vals_list$p1$Lunghezza,
Cranio = fixed_vals_list$p1$Cranio,
Sesso = fixed_vals_list$p1$Sesso,
model = model,
mean_gest = mean_gest,
mean_lung = mean_lung)
}, df1$Gestazione, df1$Fumatrici)
p1 <- ggplot(df1, aes(x=Gestazione, y=Predicted, color=Fumatrici)) +
geom_point() +
geom_smooth(method = "loess")+
labs(title="Pred. Weight vs Gestazione and Fumatrici") +
theme_classic()
#Plot 2: Peso vs Lunghezza and Fumatrici
lung_seq <- seq(300, 560, 0.5)
df2 <- expand.grid(Lunghezza = lung_seq, Fumatrici = fumatrici_levels)
df2$Predicted <- mapply(function(l,f) {
pred <- predict_func(N_gravidanze = fixed_vals_list$p2$N_gravidanze,
Fumatrici = f,
Gestazione = fixed_vals_list$p2$Gestazione,
Lunghezza = l,
Cranio = fixed_vals_list$p2$Cranio,
Sesso = fixed_vals_list$p2$Sesso,
model = model,
mean_gest = mean_gest,
mean_lung = mean_lung)
}, df2$Lunghezza, df2$Fumatrici)
p2 <- ggplot(df2, aes(x=Lunghezza, y=Predicted, color=Fumatrici)) +
geom_point() +
geom_smooth(method = "loess")+
labs(title="Pred.Weight vs Lunghezza and Fumatrici") +
theme_classic()
return(list(p1, p2))
}
#Creation of fixed vals list
fixed_vals_list <- list(
p1 = list(N_gravidanze=3, Lunghezza=490, Cranio=337, Sesso="F"),
p2 = list(N_gravidanze=3, Gestazione=39, Cranio=337, Sesso="F")
)
#Execution and visualization
plots <- create_plots(predict_weight, mod5, mean_gest, mean_lung, fixed_vals_list)
#Creation of wrap plots
wrap_plots(plots, ncol=2, nrow=1)
A nonlinear relationship between expected weight and gestational age is confirmed, with greater gain in the final weeks and lower weight for newborns of mothers who smoke. Weight and length are positively correlated, with differences observed between smoking groups.
These findings support the biological importance of gestation and anthropometric measurements and the impact of smoking on neonatal growth.
The final model provides a reliable estimate of neonatal birth weight and highlights gestational age, maternal smoking, sex, and anthropometric measurements as determining factors. Predictive performance is generally satisfactory. Greater uncertainty is observed at extreme weight values, suggesting that model estimates should complement rather than replace clinical judgment.
Operationally, the model can support the identification of high-risk cases, such as preterm births, and resource planning to manage such cases. Future developments could include the integration of additional fetal growth data and additional maternal variables to improve personalization and predictive robustness.