Birth weight is one of the most important indicators of neonatal health, as it is strongly associated with several short- and long-term medical outcomes. Low birth weight can increase the risk of respiratory complications, metabolic disorders and developmental delays during the first stages of life.
The company Neonatal Health Solutions aims to support hospitals through data analysis and predictive modelling in order to better understand the factors that influence neonatal outcomes.
In this project we analyse a dataset containing information on 2500 newborns collected from three different hospitals. The goal is to identify the main factors affecting newborn birth weight and build a statistical model capable of predicting it using maternal characteristics, pregnancy information and neonatal measurements.
The analysis combines exploratory data analysis, hypothesis testing and multiple linear regression in order to provide interpretable insights that may help healthcare professionals improve prenatal monitoring and hospital resource planning.
#Define Coefficient of Variation function
CV <-function(x){
return(sd(x)/mean(x) * 100)
}
# Define variability function
variability <- function(x) {
rng <- max(x, na.rm = TRUE) - min(x, na.rm = TRUE)
iqr <- IQR(x, na.rm = TRUE)
variance <- round(var(x, na.rm = TRUE), 2)
stdev <- round(sd(x, na.rm = TRUE), 2)
coeff_var <- round(CV(x), 2)
return(data.frame(
range = rng,
IQR = iqr,
variance = variance,
standard_deviation = stdev,
coefficient_of_variation = coeff_var
))
}
#Define shape function
shape <- function(x) {
#Calculate skewness e kurtosis
skew <- round(skewness(x, na.rm = TRUE), 2)
kurt <- round(kurtosis(x, na.rm = TRUE) - 3, 2)
#Return values
return(data.frame(
skewness = skew,
kurtosis = kurt
))
}
#Model fit statistics
model_fit_table <- function(model, caption = "Model fit statistics") {
fit <- data.frame(
R_squared = summary(model)$r.squared,
Adj_R_squared = summary(model)$adj.r.squared,
Residual_SE = summary(model)$sigma,
F_statistic = summary(model)$fstatistic[1],
AIC = AIC(model),
BIC = BIC(model)
)
fit <- round(fit, 3)
knitr::kable(
fit,
caption = caption
)
}
newborn <- read.csv("neonati.csv", stringsAsFactors= T, sep=",")
attach(newborn)
nrow(newborn)
## [1] 2500
| Anni.madre | N.gravidanze | Fumatrici | Gestazione | Peso | Lunghezza | Cranio | Tipo.parto | Ospedale | Sesso |
|---|---|---|---|---|---|---|---|---|---|
| 26 | 0 | 0 | 42 | 3380 | 490 | 325 | Nat | osp3 | M |
| 21 | 2 | 0 | 39 | 3150 | 490 | 345 | Nat | osp1 | F |
| 34 | 3 | 0 | 38 | 3640 | 500 | 375 | Nat | osp2 | M |
| 28 | 1 | 0 | 41 | 3690 | 515 | 365 | Nat | osp2 | M |
| 20 | 0 | 0 | 38 | 3700 | 480 | 335 | Nat | osp3 | F |
The dataset neonati includes ten variables collected from three different hospitals, encompassing both qualitative and quantitative data, with a total of 2,500 observations. A clear understanding of the nature and type of each variable is essential for selecting appropriate statistical techniques and building a reliable predictive model for birth weight, which is the primary target variable in this study.
Below is a description and classification of each variable:
Maternal Smoking (Fumatrici): A binary qualitative nominal variable
indicating whether the mother is a smoker (1) or non-smoker (0).knitr::kable(summary(newborn))
| Anni.madre | N.gravidanze | Fumatrici | Gestazione | Peso | Lunghezza | Cranio | Tipo.parto | Ospedale | Sesso | |
|---|---|---|---|---|---|---|---|---|---|---|
| Min. : 0.00 | Min. : 0.0000 | Min. :0.0000 | Min. :25.00 | Min. : 830 | Min. :310.0 | Min. :235 | Ces: 728 | osp1:816 | F:1256 | |
| 1st Qu.:25.00 | 1st Qu.: 0.0000 | 1st Qu.:0.0000 | 1st Qu.:38.00 | 1st Qu.:2990 | 1st Qu.:480.0 | 1st Qu.:330 | Nat:1772 | osp2:849 | M:1244 | |
| Median :28.00 | Median : 1.0000 | Median :0.0000 | Median :39.00 | Median :3300 | Median :500.0 | Median :340 | NA | osp3:835 | NA | |
| Mean :28.16 | Mean : 0.9812 | Mean :0.0416 | Mean :38.98 | Mean :3284 | Mean :494.7 | Mean :340 | NA | NA | NA | |
| 3rd Qu.:32.00 | 3rd Qu.: 1.0000 | 3rd Qu.:0.0000 | 3rd Qu.:40.00 | 3rd Qu.:3620 | 3rd Qu.:510.0 | 3rd Qu.:350 | NA | NA | NA | |
| Max. :46.00 | Max. :12.0000 | Max. :1.0000 | Max. :43.00 | Max. :4930 | Max. :565.0 | Max. :390 | NA | NA | NA |
Here below are some considerations regarding the initial analysis of the variables:
At first glance, the minimum value of the mother’s age variable warrants further investigation. A recorded minimum of 0 years is clearly invalid, as no biological or clinical scenario justifies such a value. This anomaly is most likely the result of a data entry error or a missing value incorrectly encoded as 0. To visually inspect potential outliers a scatter plot will be generated.
#Create the scatter plot
ggplot(newborn, aes(x = 1:nrow(newborn), y = Anni.madre)) +
# Conditionally color points red if they are outliers
geom_point(aes(color = Anni.madre < (quantile(Anni.madre, 0.25) - 1.5 *
IQR(Anni.madre))),
size = 1) +
# Label only the outlier points
geom_text(aes(label = ifelse(Anni.madre < (quantile(Anni.madre, 0.25) - 1.5 *
IQR(Anni.madre)),
Anni.madre, "")),
vjust = 1.5, show.legend = FALSE, color = "red", size = 3) +
# Manual color mapping: TRUE = red, FALSE = black
scale_color_manual(values = c("black", "red")) +
# Styling
labs(title = "Mother's Age Values",
x = "Observation ID",
y = "Mother's Age") +
theme_light() +
theme(legend.position = "none",
plot.title = element_text(hjust = 0.5))
The scatter plot highlights the presence of five outliers in the variable representing mother’s age. Among them, three values (13 and 14 years old) can be considered extremely rare but biologically plausible, as early teenage pregnancies, although uncommon, are possible. However, the values 0 and 1 are clearly implausible and most likely result from data entry or recording errors.
newborn_clean <- newborn %>%
filter(!(Anni.madre %in% c(0, 1)))
attach(newborn_clean)
## The following objects are masked from newborn:
##
## Anni.madre, Cranio, Fumatrici, Gestazione, Lunghezza, N.gravidanze,
## Ospedale, Peso, Sesso, Tipo.parto
knitr::kable(summary(newborn_clean))
| Anni.madre | N.gravidanze | Fumatrici | Gestazione | Peso | Lunghezza | Cranio | Tipo.parto | Ospedale | Sesso | |
|---|---|---|---|---|---|---|---|---|---|---|
| Min. :13.00 | Min. : 0.0000 | Min. :0.00000 | Min. :25.00 | Min. : 830 | Min. :310.0 | Min. :235 | Ces: 728 | osp1:816 | F:1255 | |
| 1st Qu.:25.00 | 1st Qu.: 0.0000 | 1st Qu.:0.00000 | 1st Qu.:38.00 | 1st Qu.:2990 | 1st Qu.:480.0 | 1st Qu.:330 | Nat:1770 | osp2:848 | M:1243 | |
| Median :28.00 | Median : 1.0000 | Median :0.00000 | Median :39.00 | Median :3300 | Median :500.0 | Median :340 | NA | osp3:834 | NA | |
| Mean :28.19 | Mean : 0.9816 | Mean :0.04163 | Mean :38.98 | Mean :3284 | Mean :494.7 | Mean :340 | NA | NA | NA | |
| 3rd Qu.:32.00 | 3rd Qu.: 1.0000 | 3rd Qu.:0.00000 | 3rd Qu.:40.00 | 3rd Qu.:3620 | 3rd Qu.:510.0 | 3rd Qu.:350 | NA | NA | NA | |
| Max. :46.00 | Max. :12.0000 | Max. :1.00000 | Max. :43.00 | Max. :4930 | Max. :565.0 | Max. :390 | NA | NA | NA |
Reviewing the summary of the cleaned dataset, it is evident that all variables, except for Mother’s Age, maintain trends consistent with those seen in the original data. The cleaning process has not significantly altered the distributions of variables such as gestation, weight, or newborn sex.
Regarding the Mother’s Age variable, the updated summary shows a minimum value of 13 and a maximum of 46, both of which are biologically plausible. The median remains at 28, and the mean is now 28.19, closely aligned with the median. This suggests a symmetrical and approximately normal distribution of maternal age in the cleaned data. The removal of invalid entries (such as ages 0 and 1) has improved the variable’s reliability and accuracy for statistical modeling.
### Descriptive Statistics
To further confirm the findings highlighted by the measures of position, an analysis of variability is essential. This allows us to better understand the variables.
#Apply the variability() function to each quantitative variable
variability_table <- do.call(rbind, lapply(newborn_clean[, quant_var], variability))
#Display the result as a nicely formatted table
knitr::kable(variability_table, digits = 2)
| range | IQR | variance | standard_deviation | coefficient_of_variation | |
|---|---|---|---|---|---|
| Anni.madre | 33 | 7 | 27.22 | 5.22 | 18.51 |
| N.gravidanze | 12 | 1 | 1.64 | 1.28 | 130.50 |
| Peso | 4100 | 630 | 275865.90 | 525.23 | 15.99 |
| Gestazione | 18 | 2 | 3.49 | 1.87 | 4.79 |
| Lunghezza | 255 | 30 | 693.21 | 26.33 | 5.32 |
| Cranio | 155 | 20 | 269.93 | 16.43 | 4.83 |
To enhance the finding a graphical representation and an in-deep analysis of skewness and kurtosis and the boxplot is essential.
#Apply the shape() function to each quantitative variable
shape_table <- do.call(rbind, lapply(newborn_clean[, quant_var], shape))
#Display the result as a nicely formatted table
knitr::kable(shape_table, digits = 2)
| skewness | kurtosis | |
|---|---|---|
| Anni.madre | 0.15 | -0.11 |
| N.gravidanze | 2.51 | 10.98 |
| Peso | -0.65 | 2.03 |
| Gestazione | -2.07 | 8.26 |
| Lunghezza | -1.51 | 6.48 |
| Cranio | -0.79 | 2.94 |
#Reshape the dataset to long format
long_data <- newborn_clean %>%
pivot_longer(cols = all_of(quant_var), names_to = "Variable", values_to = "Value")
#Compute median and upper outlier threshold for each variable
lines_df <- long_data %>%
group_by(Variable) %>%
summarise(
Q2 = quantile(Value, 0.5, na.rm = TRUE),
LowerOut = quantile(Value, 0.25, na.rm = TRUE) - 1.5 * IQR(Value, na.rm =
TRUE),
UpperOut = quantile(Value, 0.75, na.rm = TRUE) + 1.5 * IQR(Value, na.rm = TRUE)
)
#Create the boxplots
ggplot(long_data, aes(x = "", y = Value)) +
geom_boxplot(fill = "lightblue", outlier.shape = 16, outlier.color = "red") +
geom_hline(data = lines_df, aes(yintercept = Q2), color = "red") +
geom_hline(data = lines_df, aes(yintercept = UpperOut), color = "black") +
geom_hline(data = lines_df, aes(yintercept = LowerOut), color = "grey40") +
facet_wrap(~ Variable, scales = "free_y") +
labs(title = "Boxplots of Quantitative Variables with Outliers and Thresholds",
x = NULL, y = NULL) +
theme_minimal(base_size = 13) +
theme(
plot.title = element_text(hjust = 0.5),
axis.text.x = element_blank(),
axis.ticks.x = element_blank()
)
Overall, the boxplots visually confirm the distributional properties identified through skewness and kurtosis.
Variables such as number of pregnancies and gestational duration show pronounced asymmetry and a high concentration of extreme values,which is reflected in the presence of numerous outliers and elongated whiskers.
Similarly, the heavier lower tails observed in the boxplots for newborn weight, length, and Head Circumference are consistent with their negative skewness and elevated kurtosis values.
In contrast, mother’s age displays a more balanced boxplot, in line with skewness and kurtosis values close to those of a normal distribution.
These visual patterns reinforce the quantitative evidence and support the conclusions drawn from the measures of distribution shape.
Before focusing on the inferential analysis and the formal testing of the research hypotheses, it is useful to conclude the descriptive phase by graphically representing the main variables.Visual exploration provides a more intuitive and comprehensive overview of the data,allowing patterns, differences, and potential anomalies to be identified more easily than through numerical summaries alone. These graphical representations complement the descriptive statistics and help contextualize the subsequent hypothesis tests.
# Prepare the variable
qual_df <- newborn_clean %>%
mutate(
Fumatrici = factor(Fumatrici, levels = c(0, 1), labels = c("No smoking", "Smoking")),
Tipo.parto = factor(Tipo.parto, levels = c("Nat", "Ces"), labels = c("Vaginal", "Cesarean")),
Ospedale = factor(Ospedale),
Sesso = factor(Sesso, levels = c("F", "M"), labels = c("Female", "Male"))
)
freq_cat <- qual_df %>%
select(all_of(qual_var)) %>%
pivot_longer(cols = everything(), names_to = "Variable", values_to = "Category") %>%
count(Variable, Category, name = "n") %>%
group_by(Variable) %>%
mutate(p = n / sum(n)) %>%
ungroup() %>%
mutate(
Variable = dplyr::recode(
Variable,
Fumatrici = "Maternal smoking",
Tipo.parto = "Delivery type",
Ospedale = "Birth hospital",
Sesso = "Newborn sex"
),
label = paste0(n, " (", percent(p, accuracy = 0.1), ")")
)
# Create the graph
p_cat <- ggplot(freq_cat, aes(x = n, y = Category, fill = Category)) +
geom_col(width = 0.55, show.legend = FALSE) +
geom_text(aes(label = label), hjust = -0.08, size = 3.6) +
facet_wrap(~ Variable, scales = "free_y", ncol = 2) +
scale_x_continuous(expand = expansion(mult = c(0, 0.15))) +
scale_fill_brewer(palette = "Set3") +
labs(
title = "Qualitative variables — frequency distributions",
x = "Number of observations",
y = NULL
) +
theme_minimal(base_size = 12) +
theme(
plot.title = element_text(face = "bold", hjust = 0.5),
plot.subtitle = element_text(size = 10),
strip.text = element_text(face = "bold"),
panel.grid.minor = element_blank()
)
p_cat
# Function to build frequency tables
freq_table <- function(data, var) {
data %>%
count({{ var }}, name = "Count") %>%
mutate(
Percentage = Count / sum(Count),
Percentage = percent(Percentage, accuracy = 0.1)
)
}
tab_hospital <- freq_table(qual_df, Ospedale)
tab_delivery <- freq_table(qual_df, Tipo.parto)
tab_smoking <- freq_table(qual_df, Fumatrici)
tab_sex <- freq_table(qual_df, Sesso)
kable(tab_hospital)
| Ospedale | Count | Percentage |
|---|---|---|
| osp1 | 816 | 32.7% |
| osp2 | 848 | 33.9% |
| osp3 | 834 | 33.4% |
kable(tab_delivery)
| Tipo.parto | Count | Percentage |
|---|---|---|
| Vaginal | 1770 | 70.9% |
| Cesarean | 728 | 29.1% |
kable(tab_smoking)
| Fumatrici | Count | Percentage |
|---|---|---|
| No smoking | 2394 | 95.8% |
| Smoking | 104 | 4.2% |
kable(tab_sex)
| Sesso | Count | Percentage |
|---|---|---|
| Female | 1255 | 50.2% |
| Male | 1243 | 49.8% |
Overall, the qualitative variables show a clear and interpretable structure: the sample is well balanced across the three hospitals (each contributing roughly one third of the observations), delivery mode is predominantly vaginal with a substantial minority of cesarean sections, and newborn sex is almost perfectly balanced. Maternal smoking is relatively rare in this cohort.
pretty_labels <- c(
"Mother's age (years)",
"Number of pregnancies",
"Gestation (weeks)",
"Birthweight (g)",
"Length (cm)",
"Head circumference"
)
num_long <- newborn_clean %>%
select(all_of(quant_var)) %>%
pivot_longer(cols = everything(), names_to = "Variable", values_to = "Value") %>%
mutate(
Variable = factor(Variable, levels = quant_var, labels = pretty_labels)
)
# Only Mother’s age qualifies based on your skewness/kurtosis
normal_vars <- c("Mother's age (years)")
curve_df <- num_long %>%
filter(Variable %in% normal_vars) %>%
group_by(Variable) %>%
summarise(
mu = mean(Value, na.rm = TRUE),
sigma = sd(Value, na.rm = TRUE),
xmin = quantile(Value, 0.01, na.rm = TRUE),
xmax = quantile(Value, 0.99, na.rm = TRUE),
.groups = "drop"
) %>%
rowwise() %>%
mutate(x = list(seq(xmin, xmax, length.out = 200))) %>%
unnest(cols = c(x)) %>%
mutate(y = dnorm(x, mean = mu, sd = sigma)) %>%
ungroup()
p_num_scientific <- ggplot(num_long, aes(x = Value)) +
geom_histogram(aes(y = after_stat(density)),
bins = 35, alpha = 0.85) +
geom_line(
data = curve_df,
aes(x = x, y = y),
linewidth = 1
) +
facet_wrap(~ Variable, scales = "free", ncol = 3) +
labs(
title = "Numeric variables — distributions (post-cleaning)",
subtitle = "Normal curve added only where skewness and kurtosis indicate approximate normality",
x = NULL,
y = "Density"
) +
theme_minimal(base_size = 12) +
theme(
plot.title = element_text(face = "bold", hjust = 0.5),
plot.subtitle = element_text(size = 10),
strip.text = element_text(face = "bold"),
panel.grid.minor = element_blank()
)
p_num_scientific
Normal density curves were superimposed only for variables that satisfied approximate normality conditions, defined by |skewness| < 0.5 and kurtosis close to 0. Based on these criteria, only maternal age exhibited a distribution sufficiently close to normal, while all other variables showed significant asymmetry and/or heavy tails. Importantly, the visual patterns observed in the histograms are fully consistent with the previously computed numerical shape measures (skewness and kurtosis) as well as with the information provided by the boxplots. In particular, variables characterized by strong skewness and heavy tails in the numerical analysis also display asymmetric shapes and numerous outliers in the boxplots, whereas variables closer to normality show more symmetric and compact distributions. This coherence between graphical and numerical summaries reinforces the robustness of the descriptive analysis and supports the interpretation of the underlying distributional properties of the data.
Having completed this descriptive phase, we now proceed to the inferential analysis, where the following research hypotheses will be formally tested using appropriate statistical methods: (i) whether the proportion of cesarean deliveries differs across hospitals, (ii) whether the mean birthweight and mean length of the newborns in this sample are significantly equal to the corresponding population values, and (iii) whether anthropometric measurements differ significantly between male and female newborns.
Cesarean delivery is performed more often in some hospital
To investigate whether the frequency of cesarean deliveries differs across hospitals, considering also that both variables are categorical, a Chi-square test of independence will be applied to the contingency table formed by hospital and delivery type. This test assesses whether the observed distribution of vaginal and cesarean births deviates from what would be expected under the assumption of independence. A significant result would indicate that the likelihood of undergoing a cesarean section varies by hospital, suggesting differences in clinical practices or patient populations.
This can be formulated as:# Contingency table: Hospital × Delivery type
tab_delivery_hospital <- table(qual_df$Ospedale, qual_df$Tipo.parto)
tab_delivery_hospital_df <- as.data.frame(tab_delivery_hospital) %>%
rename(Hospital = Var1, DeliveryType = Var2, n = Freq) %>%
group_by(Hospital) %>%
mutate(prop = n / sum(n)) %>%
ungroup()
kable(tab_delivery_hospital)
| Vaginal | Cesarean | |
|---|---|---|
| osp1 | 574 | 242 |
| osp2 | 594 | 254 |
| osp3 | 602 | 232 |
# Chi-square test
chi <- chisq.test(tab_delivery_hospital)
knitr::kable(tidy(chi))
| statistic | p.value | parameter | method |
|---|---|---|---|
| 1.082984 | 0.5818793 | 2 | Pearson’s Chi-squared test |
# Check assumptions: expected frequencies
chi_expected <- chisq.test(tab_delivery_hospital)$expected
knitr::kable(round(chi_expected, 1))
| Vaginal | Cesarean | |
|---|---|---|
| osp1 | 578.2 | 237.8 |
| osp2 | 600.9 | 247.1 |
| osp3 | 590.9 | 243.1 |
# Grahpical rapresentation
p1 <- ggplot(tab_delivery_hospital_df, aes(x = Hospital, y = prop, fill = DeliveryType)) +
geom_col(position = "fill", width = 0.7) +
scale_y_continuous(labels = percent_format(accuracy = 1)) +
labs(
title = "Delivery type distribution by hospital",
x = "Hospital",
y = "Proportion",
fill = "Delivery type"
) +
theme_minimal(base_size = 12) +
theme(
plot.title = element_text(face = "bold", hjust = 0.5),
panel.grid.minor = element_blank()
)
p1
A Pearson Chi-square test of independence was used to assess whether the proportion of cesarean deliveries differs across hospitals.
The test was not statistically significant (χ2 = 1.083, df = 2, p = 0.5819), indicating no evidence that delivery type is associated with the hospital in this sample.
The expected frequencies under the null hypothesis were all large (minimum expected count ≈ 237.8), satisfying the assumptions required for the Chi-square approximation.
Therefore, based on these data we conclude that the observed differences in cesarean rates across the three hospitals are compatible with random variation rather than systematic differences in clinical practice.
The average anthropometric characteristics observed in this sample of newborns—specifically birthweight and length—are representative of those observed in the general population
The second hypothesis assesses whether the average anthropometric characteristics observed in this sample of newborns—specifically Newborn weight and length—are statistically consistent with the corresponding population reference values. This step is important to evaluate the external validity of the dataset, i.e., whether the sample can be considered representative of the broader population and therefore suitable for generalizable clinical interpretation and predictive modeling. Since the comparison is between a sample mean to a known population value, the appropriate test is a one-sample t-test.
Before performing the test, we will check:
mu_weight <- 3300
mu_length <- 500
# One-sample t-test: Birthweight
t_weight <- t.test(newborn_clean$Peso, mu = mu_weight)
# One-sample t-test: Length
t_length <- t.test(newborn_clean$Lunghezza, mu = mu_length)
# Create a summary table
results <- data.frame(
Variable = c("Birthweight", "Length"),
Mean = c(mean(newborn_clean$Peso),
mean(newborn_clean$Lunghezza)),
t_value = c(t_weight$statistic, t_length$statistic),
df = c(t_weight$parameter, t_length$parameter),
p_value = c(t_weight$p.value, t_length$p.value)
)
# Round ONLY numeric columns
results[, -1] <- round(results[, -1], 3)
knitr::kable(results,
caption = "t-tests comparing sample means to population reference values")
| Variable | Mean | t_value | df | p_value |
|---|---|---|---|---|
| Birthweight | 3284.184 | -1.505 | 2497 | 0.132 |
| Length | 494.696 | -10.069 | 2497 | 0.000 |
A one-sample t-test was conducted to assess whether the mean birthweight and length observed in this sample of newborns differ from their respective population reference values.
For birthweight, the sample mean (3284.2 g) is slightly lower than the reference value of 3300 g; however, this difference is not statistically significant (t = −1.505, p = 0.132). Therefore, there is no evidence to reject the null hypothesis, and the sample mean birthweight can be considered statistically consistent with the population reference value.
In contrast, the results for length show a statistically significant difference from the population reference value (t = −10.069, p = 0). The observed sample mean (494.7 units) is substantially lower than the reference value, indicating that newborn length in this sample differs systematically from the expected population average. Given the large sample size, this result is both statistically robust and unlikely to be due to random variation alone.
Overall, while birthweight appears representative of population standards, newborn length does not. This suggests that, although the dataset is broadly representative in terms of weight, caution is warranted when interpreting or generalizing results related to length, as the sample may reflect structural differences (e.g., measurement scale, clinical practices, or population characteristics) relative to the chosen reference values.
Anthropometric measurements differ significantly between the two sexes.
The third hypothesis investigates whether anthropometric measurements differ significantly between male and female newborns. From a statistical perspective, this hypothesis involves comparing the distribution of continuous anthropometric variables—such as birthweight, length, and head circumference—between two independent groups defined by newborn sex. The analysis will therefore focus on evaluating whether the mean values of these measurements differ between males and females.
Depending on the distributional properties of the variables and the robustness of the results observed during the descriptive analysis, appropriate two-sample comparison tests will be employed. In particular, independent-samples t-tests will be used when assumptions of approximate normality and homoscedasticity are reasonably satisfied.
# Data preparation
sex_df <- newborn_clean %>%
mutate(Sesso_raw = trimws(toupper(as.character(Sesso)))) %>%
mutate(
Sesso = case_when(
Sesso_raw %in% c("F", "FEMALE", "FEMMINA") ~ "Female",
Sesso_raw %in% c("M", "MALE", "MASCHIO") ~ "Male",
TRUE ~ NA_character_
)
) %>%
filter(!is.na(Sesso)) %>%
mutate(Sesso = factor(Sesso, levels = c("Female", "Male")))
# Variables to compare
anthro_vars <- c("Peso", "Lunghezza", "Cranio")
# Boxplot
sex_long <- sex_df %>%
select(Sesso, all_of(anthro_vars)) %>%
pivot_longer(cols = all_of(anthro_vars),
names_to = "Variable",
values_to = "Value") %>%
mutate(
Variable = dplyr::recode(
Variable,
Peso = "Birthweight",
Lunghezza = "Length",
Cranio = "Head circumference"
)
)
p_box <- ggplot(sex_long, aes(x = Sesso, y = Value, fill = Sesso)) +
geom_boxplot(width = 0.55, outlier.alpha = 0.25) +
facet_wrap(~ Variable, scales = "free", ncol = 3) +
labs(
title = "Anthropometric measures by sex",
x = NULL,
y = NULL
) +
theme_minimal(base_size = 12) +
theme(
plot.title = element_text(face = "bold", hjust = 0.5),
strip.text = element_text(face = "bold"),
legend.position = "none",
panel.grid.minor = element_blank()
)
p_box
# t-test
t_peso <- t.test(Peso ~ Sesso, data = sex_df)
t_lung <- t.test(Lunghezza ~ Sesso, data = sex_df)
t_cran <- t.test(Cranio ~ Sesso, data = sex_df)
results <- data.frame(
Variable = c("Birthweight", "Length", "Head circumference"),
Female_mean = c(
mean(sex_df$Peso[sex_df$Sesso == "Female"], na.rm = TRUE),
mean(sex_df$Lunghezza[sex_df$Sesso == "Female"], na.rm = TRUE),
mean(sex_df$Cranio[sex_df$Sesso == "Female"], na.rm = TRUE)
),
Male_mean = c(
mean(sex_df$Peso[sex_df$Sesso == "Male"], na.rm = TRUE),
mean(sex_df$Lunghezza[sex_df$Sesso == "Male"], na.rm = TRUE),
mean(sex_df$Cranio[sex_df$Sesso == "Male"], na.rm = TRUE)
),
t_value = c(unname(t_peso$statistic), unname(t_lung$statistic), unname(t_cran$statistic)),
df = c(unname(t_peso$parameter), unname(t_lung$parameter), unname(t_cran$parameter)),
p_value = c(t_peso$p.value, t_lung$p.value, t_cran$p.value)
)
# Round ONLY numeric columns
results[, -1] <- round(results[, -1], 3)
knitr::kable(
results,
caption = "Two-sample t-tests comparing anthropometric measures between female and male newborns"
)
| Variable | Female_mean | Male_mean | t_value | df | p_value |
|---|---|---|---|---|---|
| Birthweight | 3161.061 | 3408.496 | -12.115 | 2488.674 | 0 |
| Length | 489.764 | 499.675 | -9.582 | 2457.301 | 0 |
| Head circumference | 337.623 | 342.459 | -7.437 | 2489.389 | 0 |
The results of the two-sample t-tests indicate that all three anthropometric measures —birthweight, length, and head circumference— differ significantly between female and male newborns. In each case, the p-value is effectively zero, providing very strong statistical evidence against the null hypothesis of equal means between sexes.
Specifically, male newborns exhibit higher average values for all considered measures. The mean birthweight of male newborns (≈ 3408 g) is substantially higher than that of females (≈ 3161 g). Similarly, males show larger average head circumference (≈ 342 mm vs ≈ 338 mm) and greater average length (≈ 499 mm vs ≈ 490 mm). The negative t-values observed in all tests reflect the fact that the mean for females is consistently lower than that for males.
The accompanying boxplots visually reinforce these findings. For each anthropometric variable, the distribution for male newborns is clearly shifted upward relative to that for females, with higher medians and upper quartiles. While there is some overlap between the distributions—expected in biological data—the overall separation between sexes is evident and consistent across all three measures.
Importantly, the graphical evidence is fully coherent with the numerical results of the t-tests. The systematic differences in central tendency observed in the boxplots align with the statistically significant test outcomes, supporting the conclusion that sex is a meaningful determinant of neonatal anthropometric characteristics in this dataset. As a result, newborn sex should be considered a relevant explanatory variable in subsequent analyses and predictive modeling.
The next step of the analysis focuses on the development of a multiple linear regression model aimed at explaining and predicting newborn birthweight. Birthweight is treated as a continuous response variable and represents a key indicator of neonatal health, making it a suitable outcome for a regression-based modeling approach.
By fitting a multiple linear regression model, it will be possible to quantify the marginal effect of each predictor on birthweight while holding all other variables constant. This approach allows us to disentangle the individual contribution of correlated factors and to identify the most influential determinants of neonatal weight.
As a first step in the regression workflow, a correlation matrix is computed for the quantitative variables included in the analysis. This provides an initial overview of the linear relationships between predictors and the response variable, and it is particularly useful to identify clusters of strongly correlated predictors (e.g., anthropometric measures) that may lead to multicollinearity in a multiple regression model. In practice, high correlations suggest that some predictors may carry overlapping information, potentially inflating standard errors and reducing the stability and interpretability of coefficient estimates. For this reason, the correlation structure is used to guide subsequent modeling choices, such as prioritizing clinically meaningful variables, evaluating multicollinearity, and motivating potential model refinements during the selection phase.
#select numerica variables for correlation
corr_df <- newborn_clean %>%
select(all_of(quant_var)) %>%
mutate(across(everything(), as.numeric))
# Correlation matrix
corr_mat <- cor(corr_df, use = "complete.obs", method = "pearson")
# Correlation matrix as a report table
kable(round(corr_mat, 3), caption = "Correlation matrix (Pearson) for quantitative variables")
| Anni.madre | N.gravidanze | Peso | Gestazione | Lunghezza | Cranio | |
|---|---|---|---|---|---|---|
| Anni.madre | 1.000 | 0.383 | -0.024 | -0.135 | -0.065 | 0.016 |
| N.gravidanze | 0.383 | 1.000 | 0.002 | -0.102 | -0.060 | 0.039 |
| Peso | -0.024 | 0.002 | 1.000 | 0.592 | 0.796 | 0.705 |
| Gestazione | -0.135 | -0.102 | 0.592 | 1.000 | 0.619 | 0.461 |
| Lunghezza | -0.065 | -0.060 | 0.796 | 0.619 | 1.000 | 0.603 |
| Cranio | 0.016 | 0.039 | 0.705 | 0.461 | 0.603 | 1.000 |
# Custom correlation panel
panel.cor <- function(x, y, digits = 2, prefix = "", ...) {
par(usr = c(0, 1, 0, 1))
r <-cor(x, y)
txt <- formatC(r, digits = digits, format = "f")
# Base size + gentle scaling
cex_base <- 0.9
cex_scale <- 1.2
text(0.5, 0.5, txt, cex = cex_base + cex_scale * r)
}
# Select quantitative variable
corr_df <- newborn_clean %>%
select(all_of(quant_var)) %>%
mutate(across(everything(), as.numeric))
# Scatterplot matrix with correlations
pairs(
corr_df,
upper.panel = panel.cor,
lower.panel = panel.smooth,
main = "Scatterplot matrix and pairwise correlations (quantitative variables)"
)
The scatterplot matrix provides a comprehensive overview of the pairwise relationships among the quantitative variables included in the regression analysis. Several clear linear associations emerge, particularly between birthweight and the anthropometric measures available before birth. Birthweight shows a strong positive correlation with newborn length (r ≈ 0.80) and head circumference (r ≈ 0.70), indicating that larger fetal dimensions are closely associated with higher weight at birth.
A moderate to strong positive relationship is also observed between gestational duration and birthweight (r ≈ 0.59), which is consistent with clinical expectations: longer gestation allows for greater fetal growth. Gestational duration is additionally correlated with length (r ≈ 0.62) and, to a lesser extent, with head circumference (r ≈ 0.46), suggesting that gestation affects multiple dimensions of neonatal growth.
Maternal characteristics such as mother’s age and number of pregnancies exhibit relatively weak correlations with birthweight and the other anthropometric variables. This indicates that, while these factors may still play a role in a multivariable context, their direct linear association with neonatal size is limited when considered in isolation.
Overall, the correlation structure highlights the presence of strong interrelationships among anthropometric predictors, particularly length and head circumference, which may introduce multicollinearity in the regression model.
Based on the insights obtained from the descriptive analysis and the correlation structure of the quantitative variables, a multiple linear regression model is specified with newborn birthweight as the response variable. The initial formulation of the model includes all predictors that are clinically plausible or potentially associated with birthweight, allowing for a comprehensive assessment of their joint effects within a multivariable framework. This full specification serves as a reference model from which more parsimonious alternatives can be derived through subsequent model refinement and selection procedures.
mod1 <- lm(
Peso ~ Anni.madre + N.gravidanze + Fumatrici + Gestazione +
Lunghezza + Cranio + Tipo.parto + Ospedale + Sesso,
data = newborn_clean
)
kable(
round(summary(mod1)$coefficients, 3),
caption = "Full multiple linear regression model for newborn birthweight"
)
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | -6735.796 | 141.479 | -47.610 | 0.000 |
| Anni.madre | 0.802 | 1.147 | 0.699 | 0.484 |
| N.gravidanze | 11.381 | 4.669 | 2.438 | 0.015 |
| Fumatrici | -30.274 | 27.549 | -1.099 | 0.272 |
| Gestazione | 32.577 | 3.821 | 8.526 | 0.000 |
| Lunghezza | 10.292 | 0.301 | 34.207 | 0.000 |
| Cranio | 10.472 | 0.426 | 24.567 | 0.000 |
| Tipo.partoNat | 29.634 | 12.091 | 2.451 | 0.014 |
| Ospedaleosp2 | -11.091 | 13.447 | -0.825 | 0.410 |
| Ospedaleosp3 | 28.250 | 13.505 | 2.092 | 0.037 |
| SessoM | 77.572 | 11.187 | 6.934 | 0.000 |
The results of the full multiple linear regression model highlight several statistically significant determinants of newborn birthweight, while others do not appear to contribute meaningfully once all covariates are considered simultaneously. Among maternal characteristics, mother’s age does not show a significant association with birthweight, whereas the number of previous pregnancies is positively and significantly associated with birthweight, suggesting that multiparity is linked to slightly heavier newborns.
As expected, gestational duration emerges as a strong and highly significant predictor: each additional week of gestation is associated with an average increase in birthweight, holding all other variables constant. Similarly, the anthropometric measures newborn length and head circumference show very strong positive effects, confirming their close relationship with birthweight and their central role in explaining variability in neonatal size.
Regarding behavioral and contextual factors, maternal smoking exhibits a negative coefficient, indicating lower birthweight among newborns of smoking mothers; however, this effect is not statistically significant in the full model. The type of delivery shows a modest but statistically significant association, with vaginal deliveries associated with slightly higher birthweight compared to the reference category. This result likely reflects underlying pregnancy characteristics rather than a causal effect of delivery mode.
Hospital-related effects are limited: one hospital does not differ significantly from the reference hospital, while another shows a small but statistically significant difference in average birthweight. Finally, newborn sex remains a highly significant predictor even after adjustment for all other variables, with male newborns weighing on average more than female newborns. Overall, the model confirms that gestational age, anthropometric measures, parity, and sex are the primary drivers of birthweight in this dataset, while other factors play a secondary or negligible role once these key predictors are taken into account.
model_fit_table(
mod1,
caption = "Model fit statistics for the full regression model")
| R_squared | Adj_R_squared | Residual_SE | F_statistic | AIC | BIC | |
|---|---|---|---|---|---|---|
| value | 0.729 | 0.728 | 274.021 | 668.676 | 35145.57 | 35215.45 |
Starting from the full model, variable selection procedures based on information criteria are applied to identify a more parsimonious specification. Stepwise selection using the Akaike Information Criterion (AIC) balances goodness of fit and model complexity, while the Bayesian Information Criterion (BIC) provides a more conservative alternative that favors simpler models. This step aims to retain only those predictors that meaningfully contribute to explaining birthweight, improving interpretability without sacrificing explanatory power.
# Backward stepwise removing Mother's age
mod2 <- update(mod1,~.-Anni.madre)
# Coefficient table
kable(
round(summary(mod2)$coefficients, 3),
caption = "Regression model after removing Mother's age")
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | -6708.619 | 136.021 | -49.320 | 0.000 |
| N.gravidanze | 12.583 | 4.340 | 2.899 | 0.004 |
| Fumatrici | -30.427 | 27.545 | -1.105 | 0.269 |
| Gestazione | 32.300 | 3.800 | 8.501 | 0.000 |
| Lunghezza | 10.292 | 0.301 | 34.209 | 0.000 |
| Cranio | 10.487 | 0.426 | 24.638 | 0.000 |
| Tipo.partoNat | 29.665 | 12.089 | 2.454 | 0.014 |
| Ospedaleosp2 | -10.951 | 13.444 | -0.815 | 0.415 |
| Ospedaleosp3 | 28.517 | 13.499 | 2.113 | 0.035 |
| SessoM | 77.645 | 11.185 | 6.942 | 0.000 |
# Model fit statistics
model_fit_table(
mod2,
caption = "Model fit statistics for the backward stepwise-selected: removing mother's age")
| R_squared | Adj_R_squared | Residual_SE | F_statistic | AIC | BIC | |
|---|---|---|---|---|---|---|
| value | 0.729 | 0.728 | 273.993 | 743.072 | 35144.06 | 35208.12 |
The regression model obtained after removing mother’s age (Anni.madre) shows results that are largely consistent with those of the full specification. The exclusion of maternal age does not materially alter the estimated coefficients of the remaining predictors, nor does it substantially change their statistical significance. This suggests that mother’s age did not provide meaningful additional explanatory power once the other covariates were included.
The key determinants of birthweight remain unchanged. Gestational duration, newborn length, and head circumference continue to exhibit strong and highly significant positive associations with birthweight, confirming their central role in explaining neonatal size. Newborn sex also remains highly significant, with male newborns weighing on average more than females, even after adjustment for anthropometric measures. The number of pregnancies retains a positive and statistically significant effect, while maternal smoking remains non-significant in this multivariable context.
The overall goodness-of-fit statistics remain virtually unchanged compared to the full model (R2 ≈ 0.729; Adjusted R2 ≈ 0.728). The residual standard error is also nearly identical, and the information criteria (AIC and BIC) do not indicate any substantial deterioration in model performance. This confirms that removing mother’s age does not compromise explanatory power and supports the decision to exclude it in pursuit of a more parsimonious specification.
# Backward stepwise removing Birth hospital
mod3 <- update(mod2,~.-Ospedale)
# Coefficient table
kable(
round(summary(mod3)$coefficients, 3),
caption = "Regression model after removing Birth Ospital")
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | -6708.809 | 136.064 | -49.306 | 0.000 |
| N.gravidanze | 12.993 | 4.344 | 2.991 | 0.003 |
| Fumatrici | -31.882 | 27.580 | -1.156 | 0.248 |
| Gestazione | 32.597 | 3.804 | 8.569 | 0.000 |
| Lunghezza | 10.268 | 0.301 | 34.098 | 0.000 |
| Cranio | 10.501 | 0.426 | 24.637 | 0.000 |
| Tipo.partoNat | 30.424 | 12.104 | 2.514 | 0.012 |
| SessoM | 78.103 | 11.200 | 6.974 | 0.000 |
# Model fit statistics
model_fit_table(
mod3,
caption = "Model fit statistics for the backward stepwise-selected: removing Birth Ospital")
| R_squared | Adj_R_squared | Residual_SE | F_statistic | AIC | BIC | |
|---|---|---|---|---|---|---|
| value | 0.728 | 0.727 | 274.391 | 951.291 | 35149.32 | 35201.73 |
The regression model obtained after excluding the birth hospital variable shows results that remain highly consistent with the previous specification. The estimated coefficients for the core predictors—gestational duration, anthropometric measures, parity, type of delivery, and newborn sex—are virtually unchanged in magnitude and statistical significance. This stability indicates that hospital-related differences did not substantially confound the relationship between these variables and birthweight.
From a goodness-of-fit perspective, the exclusion of the hospital variable produces only a negligible reduction in explanatory power (R2 ≈ 0.728; Adjusted R2 ≈ 0.727). The residual standard error remains essentially unchanged, indicating that predictive accuracy is preserved. Importantly, the information criteria (AIC and BIC) do not show meaningful deterioration compared to the previous model, supporting the decision to remove the hospital variable in favor of a more parsimonious specification.
Overall, these results suggest that differences across hospitals do not materially affect newborn birthweight once individual clinical and biological factors are accounted for. The simplified model therefore achieves comparable explanatory performance while improving interpretability and parsimony.
# Backward stepwise removing Maternal smoking
mod4 <- update(mod3,~.-Fumatrici)
# Coefficient table
kable(
round(summary(mod4)$coefficients, 3),
caption = "Regression model after removing maternal smoking")
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | -6708.017 | 136.072 | -49.298 | 0.000 |
| N.gravidanze | 12.736 | 4.339 | 2.935 | 0.003 |
| Gestazione | 32.325 | 3.797 | 8.514 | 0.000 |
| Lunghezza | 10.283 | 0.301 | 34.177 | 0.000 |
| Cranio | 10.506 | 0.426 | 24.648 | 0.000 |
| Tipo.partoNat | 30.160 | 12.103 | 2.492 | 0.013 |
| SessoM | 77.917 | 11.199 | 6.957 | 0.000 |
# Model fit statistics
model_fit_table(
mod4,
caption = "Model fit statistics for the backward stepwise-selected: removing maternal smoking")
| R_squared | Adj_R_squared | Residual_SE | F_statistic | AIC | BIC | |
|---|---|---|---|---|---|---|
| value | 0.728 | 0.727 | 274.41 | 1109.467 | 35148.67 | 35195.25 |
The regression model obtained after excluding maternal smoking remains highly consistent with the previous specifications, confirming that smoking status did not provide a statistically significant contribution to the explanation of birthweight in the multivariable context. The estimated coefficients of the remaining predictors are virtually unchanged in magnitude and significance, indicating that maternal smoking was not acting as a relevant confounder in this dataset.
From a model performance perspective, the exclusion of maternal smoking produces no meaningful loss in explanatory power. The R2 and adjusted R2 remain essentially unchanged (≈ 0.728 and 0.727, respectively), and the residual standard error is nearly identical to that observed in previous models. Moreover, the information criteria (AIC and BIC) show a slight improvement, supporting the decision to remove maternal smoking in favor of a more parsimonious model.
Overall, these findings suggest that maternal smoking does not independently influence birthweight once gestational duration and anthropometric measures are accounted for. The reduced model therefore maintains strong explanatory performance while improving simplicity and interpretability.
# Backward stepwise removing Type of Delivery
mod5 <- update(mod4,~.-Tipo.parto)
# Coefficient table
kable(
round(summary(mod5)$coefficients, 3),
caption = "Regression model after removing type of delivery")
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | -6681.725 | 135.804 | -49.201 | 0.000 |
| N.gravidanze | 12.455 | 4.342 | 2.869 | 0.004 |
| Gestazione | 32.383 | 3.801 | 8.520 | 0.000 |
| Lunghezza | 10.245 | 0.301 | 34.059 | 0.000 |
| Cranio | 10.541 | 0.426 | 24.717 | 0.000 |
| SessoM | 77.981 | 11.211 | 6.956 | 0.000 |
# Model fit statistics
model_fit_table(
mod5,
caption = "Model fit statistics for the backward stepwise-selected: removing type of delivery")
| R_squared | Adj_R_squared | Residual_SE | F_statistic | AIC | BIC | |
|---|---|---|---|---|---|---|
| value | 0.727 | 0.726 | 274.697 | 1327.343 | 35152.89 | 35193.65 |
The regression model obtained after removing the type of delivery remains stable and retains strong explanatory power. The estimated effects of the remaining predictors—gestational duration, anthropometric measures, parity, and newborn sex—are virtually unchanged and remain highly statistically significant, indicating that the exclusion of delivery type does not materially affect the model structure.
From a goodness-of-fit perspective, the reduction in model performance is minimal. The adjusted R2 decreases only marginally (from approximately 0.727 to 0.726), and the residual standard error remains close to previous values, suggesting that predictive accuracy is preserved. Although information criteria (AIC and BIC) show a slight increase, this change is modest and reflects the trade-off between model simplicity and fit.
Overall, these results support the exclusion of delivery type in favor of a more parsimonious model, as this variable does not provide substantial additional explanatory power once key biological and gestational factors are accounted for.
# Backward stepwise removing Number of pregnancies
mod6 <- update(mod5,~.-N.gravidanze)
# Coefficient table
kable(
round(summary(mod6)$coefficients, 3),
caption = "Regression model after removing number of pregrancies")
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | -6651.673 | 135.595 | -49.055 | 0 |
| Gestazione | 31.326 | 3.788 | 8.269 | 0 |
| Lunghezza | 10.202 | 0.301 | 33.909 | 0 |
| Cranio | 10.671 | 0.425 | 25.126 | 0 |
| SessoM | 79.103 | 11.220 | 7.050 | 0 |
# Model fit statistics
model_fit_table(
mod6,
caption = "Model fit statistics for the backward stepwise-selected: number of pregrancies")
| R_squared | Adj_R_squared | Residual_SE | F_statistic | AIC | BIC | |
|---|---|---|---|---|---|---|
| value | 0.726 | 0.726 | 275.095 | 1652.329 | 35159.12 | 35194.06 |
The regression coefficients quantify how each explanatory variable affects newborn birth weight while holding the other variables constant.
Gestational duration shows a positive effect on birth weight, meaning that each additional week of pregnancy is associated with an increase in the expected newborn weight.
Anthropometric measurements such as newborn length and cranial diameter also appear to be strong predictors of birth weight, which is consistent with biological expectations.
Maternal behavioural and demographic variables may also contribute to explaining variability in birth weight, although their impact is generally smaller compared with gestational duration and anthropometric measurements.
After removing the number of previous pregnancies (N.gravidanze), the regression model remains strongly driven by gestational and anthropometric factors. Gestational duration, newborn length, head circumference, and newborn sex continue to show highly significant associations with birthweight, and their estimated effects remain stable in magnitude.
From a goodness-of-fit perspective, the explanatory power of the model decreases only marginally (Adjusted R2 ≈ 0.726), and the residual standard error increases slightly. However, the information criteria (AIC and BIC) show a modest deterioration compared to the previous specification, suggesting that removing parity slightly reduces model performance.
Overall, while the model without the number of pregnancies still explains a substantial proportion of the variability in birthweight, the slight worsening of fit indicators suggests that parity contributes additional explanatory information and may be retained in the final specification.
To identify the most appropriate regression specification, a model selection phase was conducted starting from the full multiple linear regression model. The goal is to obtain a parsimonious model that retains only predictors providing meaningful explanatory power, while avoiding unnecessary complexity. In this context, backward stepwise selection is applied, progressively removing predictors and evaluating the impact on overall model quality.
Model comparison is performed using information criteria, namely the Akaike Information Criterion (AIC) and the Bayesian Information Criterion (BIC). Both criteria balance model fit and complexity, but BIC penalizes complexity more strongly and therefore tends to select simpler models. The selected model is subsequently evaluated through adjusted R2, residual standard error (RSE), ANOVA.
To formally assess whether progressively simplified regression specifications lead to a statistically meaningful loss of explanatory power, a nested-model ANOVA (partial F-test) will be conducted across the sequence of candidate models. This procedure compares each reduced model to the preceding (more complex) specification under the assumption of nesting, and tests whether the exclusion of predictors results in a significant increase in the residual sum of squares. In this context, ANOVA provides an inferential complement to information-criterion approaches (AIC/BIC), allowing model simplification decisions to be supported by formal hypothesis testing.
anova_table <- anova(mod1,mod2,mod3,mod4,mod5,mod6)
rownames(anova_table) <- c("Model 1", "Model 2", "Model 3", "Model 4", "Model 5", "Model 6")
knitr::kable(anova_table, caption = "Confronto ANOVA")
| Res.Df | RSS | Df | Sum of Sq | F | Pr(>F) | |
|---|---|---|---|---|---|---|
| Model 1 | 2487 | 186743194 | NA | NA | NA | NA |
| Model 2 | 2488 | 186779904 | -1 | -36710.0 | 0.4888948 | 0.4844861 |
| Model 3 | 2490 | 187473818 | -2 | -693913.8 | 4.6206866 | 0.0099307 |
| Model 4 | 2491 | 187574428 | -1 | -100610.4 | 1.3399046 | 0.2471620 |
| Model 5 | 2492 | 188042054 | -1 | -467625.7 | 6.2277237 | 0.0126409 |
| Model 6 | 2493 | 188663107 | -1 | -621053.3 | 8.2710355 | 0.0040625 |
Model 6 was selected as the best-performing specification after comparing all candidate models using information criteria
(AIC/BIC), predictive accuracy metrics (e.g., RMSE), and diagnostic checks on residual behavior. Overall, Model 6 provides the most favorable trade-off between goodness-of-fit and parsimony: it achieves strong predictive performance without introducing unnecessary complexity that could compromise generalizability or interpretability.
The interaction effect between gestational duration and maternal smoking is tested starting from the selected final model (Model 6), in order to preserve the logic of the model selection process. This specification allows us to assess whether the effect of gestational duration on birthweight differs between smoking and non-smoking mothers, without reintroducing predictors that had already been excluded in previous steps.
# Interaction added starting from the selected final model (mod6)
mod6_int <- update(mod6, . ~ . + Gestazione:Fumatrici)
# Comparison between mod6 and the interaction model
interaction_comp <- data.frame(
Model = c("Model 6", "Model 6 + Gestazione:Fumatrici"),
AIC = c(AIC(mod6), AIC(mod6_int)),
BIC = c(BIC(mod6), BIC(mod6_int)),
Adj_R2 = c(summary(mod6)$adj.r.squared,
summary(mod6_int)$adj.r.squared)
)
knitr::kable(
interaction_comp,
digits =3,
caption = "Comparison between Model 6 and Model 6 with interaction"
)
| Model | AIC | BIC | Adj_R2 |
|---|---|---|---|
| Model 6 | 35159.12 | 35194.06 | 0.726 |
| Model 6 + Gestazione:Fumatrici | 35160.12 | 35200.89 | 0.726 |
anova(mod6, mod6_int)
## Analysis of Variance Table
##
## Model 1: Peso ~ Gestazione + Lunghezza + Cranio + Sesso
## Model 2: Peso ~ Gestazione + Lunghezza + Cranio + Sesso + Gestazione:Fumatrici
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 2493 188663107
## 2 2492 188587860 1 75247 0.9943 0.3188
summary(mod6_int)
##
## Call:
## lm(formula = Peso ~ Gestazione + Lunghezza + Cranio + Sesso +
## Gestazione:Fumatrici, data = newborn_clean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1138.71 -182.57 -17.38 163.25 2623.42
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6652.2044 135.5964 -49.059 < 2e-16 ***
## Gestazione 31.5608 3.7957 8.315 < 2e-16 ***
## Lunghezza 10.1883 0.3012 33.825 < 2e-16 ***
## Cranio 10.6689 0.4247 25.122 < 2e-16 ***
## SessoM 79.3031 11.2223 7.067 2.05e-12 ***
## Gestazione:Fumatrici -0.7008 0.7028 -0.997 0.319
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 275.1 on 2492 degrees of freedom
## Multiple R-squared: 0.7262, Adjusted R-squared: 0.7257
## F-statistic: 1322 on 5 and 2492 DF, p-value: < 2.2e-16
The interaction between gestational duration and maternal smoking was tested starting from Model 6. The comparison between the two models shows that adding the interaction term does not improve the model fit. In particular, AIC and BIC slightly increase and the adjusted R² remains unchanged (0.726).
The ANOVA test confirms that the interaction term is not statistically significant (p = 0.3188). Consistently, the coefficient associated with the interaction in the regression output is not significant. Therefore, there is no evidence that the effect of gestational duration on birth weight differs between smoking and non-smoking mothers.
For this reason, the interaction term is not retained and Model 6 is kept as the final specification.
Before interpreting coefficients from Model 6, it is necessary to asses whether high multicollinearity is present among the predictors.
Multicollinearity arises when two or more regressors are strongly correlated, which may inflate standard errors, destabilize coefficient estimates, and reduce interpretability (even if predictive accuracy remains acceptable). This diagnostic is especially important when the model includes anthropometric variables (e.g., length and cranial diameter), which are biologically related and can be correlated with each other and with gestational age.
In order to quantify multicollinearity the Variance Inflation Factor (VIF) will be used. As a rule of thumb, VIF values above 5 suggest a potential concern, while values above 10 indicate severe multicollinearity that may require corrective actions such as removing or combining predictors, using regularization (e.g., ridge regression), or adopting dimension reduction strategies.
knitr::kable(
vif(mod6),
caption = "Multicollinearity among regressors"
)
| x | |
|---|---|
| Gestazione | 1.654101 |
| Lunghezza | 2.070582 |
| Cranio | 1.606316 |
| Sesso | 1.038918 |
All predictors exhibit VIF values close to 1, with the highest value observed for Length (VIF = 2.07), followed by Gestational Age (VIF = 1.65) and Cranial Diameter (VIF = 1.61).
The variable Sex shows a VIF very close to 1 (VIF = 1.04).
These values are well below the commonly accepted thresholds of 5 (moderate concern) and 10 (serious multicollinearity), indicating that no problematic collinearity exists among the regressors. Although anthropometric variables are biologically related and expected to be correlated, the observed levels of association do not inflate the variance of the coefficient estimates to a concerning degree.
Therefore, the regression coefficients in Model 6 can be considered stable and reliable, and no corrective measures (such as variable removal, transformation, or regularization techniques) are required at this stage.
# R-squared
r2 <- summary(mod6)$r.squared
adj_r2 <- summary(mod6)$adj.r.squared
# RMSE
rmse <- sqrt(mean(residuals(mod6)^2))
# Create summary table
fit_table <- data.frame(
Metric = c("R-squared",
"Adjusted R-squared",
"RMSE (grams)"),
Value = round(c(r2, adj_r2, rmse), 4)
)
kable(fit_table,
caption = "Model 6 – Goodness of Fit Metrics",
align = "c")
| Metric | Value |
|---|---|
| R-squared | 0.7261 |
| Adjusted R-squared | 0.7257 |
| RMSE (grams) | 274.8193 |
Once the final regression model has been selected, it is important to evaluate its predictive quality and verify whether the assumptions of the linear regression model are reasonably satisfied.
In particular we analyse: - the overall goodness of fit of the model through the R² value - the residual distribution - the presence of influential observations that could distort the results
# Summary of the final regression model
summary(mod6)
##
## Call:
## lm(formula = Peso ~ Gestazione + Lunghezza + Cranio + Sesso,
## data = newborn_clean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1138.1 -184.4 -17.4 163.3 2626.7
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6651.6732 135.5952 -49.055 < 2e-16 ***
## Gestazione 31.3262 3.7884 8.269 < 2e-16 ***
## Lunghezza 10.2024 0.3009 33.909 < 2e-16 ***
## Cranio 10.6706 0.4247 25.126 < 2e-16 ***
## SessoM 79.1027 11.2205 7.050 2.31e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 275.1 on 2493 degrees of freedom
## Multiple R-squared: 0.7261, Adjusted R-squared: 0.7257
## F-statistic: 1652 on 4 and 2493 DF, p-value: < 2.2e-16
# Diagnostic plots
par(mfrow=c(2,2))
plot(mod6)
The diagnostic plots help evaluate the assumptions of linear regression.
The residuals vs fitted plot is used to assess linearity and homoscedasticity, while the normal Q-Q plot provides a visual indication of whether residuals follow an approximately normal distribution.
Small deviations are common in real datasets and do not necessarily invalidate the predictive usefulness of the model.
After validating the regression model, we can use it to generate predictions for new observations.
For example, we estimate the expected birth weight of a newborn with the following characteristics:
new_case <- data.frame(
Anni.madre = 32,
N.gravidanze = 3,
Fumatrici = 0,
Gestazione = 39,
Lunghezza = 500,
Cranio = 340,
Tipo.parto = "Nat",
Ospedale = "osp1",
Sesso = "F"
)
predicted_weight <- predict(mod6, newdata = new_case)
mean_weight <- mean(newborn_clean$Peso)
prediction_table <- data.frame(
"Predicted weight (g)" = round(as.numeric(predicted_weight), 2),
"Average w (g)" = round(mean_weight, 2)
)
knitr::kable(
prediction_table,
col.names = c("Predicted weight (g)", "Average weight(g)"),
caption = "Prediction result",
align = "c"
)
| Predicted weight (g) | Average weight(g) |
|---|---|
| 3299.27 | 3284.18 |
The table compares the predicted birth weight for the hypothetical newborn with the average birth weight observed in the dataset. A prediction close to the sample mean suggests that the selected characteristics describe a fairly typical birth scenario according to the estimated model.
Visual representations are useful for understanding the relationships between the main predictors and newborn weight.
ggplot(data = newborn_clean) +
geom_point(aes(x = Gestazione,
y = Peso,
col = Sesso),
position = "jitter") +
geom_smooth(aes(x = Gestazione,
y = Peso,
col = Sesso),
se = FALSE,
method = "lm") +
labs(title = "Impact of gestational weeks and sex on newborn weight",
x = "Gestational Weeks",
y = "Birth Weight (grams)",
colour = "Sex") +
theme_classic()
## `geom_smooth()` using formula = 'y ~ x'
The plot highlights a clear positive relationship between gestational duration and newborn birth weight for both sexes. The separation between the two fitted lines also suggests that sex may contribute to differences in birth weight, with one group tending to have slightly higher predicted values across gestational weeks.
ggplot(data = newborn_clean) +
geom_boxplot(aes(x = factor(Fumatrici),
y = Peso)) +
labs(title = "Birth weight by maternal smoking status",
x = "Maternal Smoking",
y = "Birth Weight (grams)") +
scale_x_discrete(labels = c("0" = "No", "1" = "Yes")) +
theme_classic()
The boxplot compares birth weight between smoking and non-smoking mothers. Differences between the two groups may indicate that maternal smoking has a negative impact on fetal growth, leading to lower birth weights on average.
The objective of this project was to analyse clinical data from three hospitals in order to identify the main factors influencing newborn birth weight and to build a predictive statistical model.
The exploratory analysis allowed us to understand the structure of the dataset and detect potential anomalies in the data. Several hypothesis tests were performed to evaluate differences between hospitals, verify whether sample averages differ from population values, and assess anthropometric differences between male and female newborns.
A multiple linear regression model was then developed to quantify the relationship between birth weight and several explanatory variables, including maternal characteristics, gestational duration and neonatal measurements. Model selection techniques were applied in order to identify a parsimonious model with good predictive performance.
The results highlight the importance of gestational duration and newborn anthropometric measurements as key predictors of birth weight. These findings are consistent with existing medical knowledge and confirm the usefulness of statistical modelling in supporting neonatal healthcare analysis.