Company: Neonatal Health Solutions
Objective: Develop a statistical model capable of accurately predicting newborn birth weight based on clinical variables collected from three hospitals. The project aims to improve the management of high-risk pregnancies, optimize hospital resource allocation, and ensure better neonatal health outcomes.
This initiative is positioned within a broader context of increasing focus on the prevention of neonatal complications. The ability to predict birth weight represents a crucial opportunity to enhance clinical planning and mitigate risks associated with problematic deliveries, such as premature births or low-birth-weight infants. Below are the key benefits this project will bring to the company and the healthcare sector:
Enhancement of Clinical Forecasting:
Birth weight is a key indicator of neonatal health. Having an accurate
predictive model enables medical staff to intervene promptly in case of
anomalies, thereby reducing perinatal complications such as respiratory
distress or hypoglycemia.
Optimization of Hospital Resources:
Knowing in advance which newborns may require intensive care allows
hospitals to efficiently organize human and technological resources.
This leads to lower operational costs and improved planning for the use
of neonatal intensive care units (NICUs).
Prevention and Identification of Risk Factors:
The model will highlight the factors that most significantly and
negatively affect birth weight (such as maternal smoking, multiple
pregnancies, or advanced maternal age). This information is invaluable
for preventive measures and personalized pregnancy management, enabling
proactive interventions in high-risk cases.
Evaluation of Hospital Practices:
Through comparative analysis across the three participating hospitals,
the company will be able to identify potential differences in clinical
outcomes, such as higher rates of cesarean deliveries in specific
facilities. This allows for quality monitoring and the harmonization of
clinical protocols across hospitals, leading to more consistent
standards of care.
Support for Strategic Planning:
Data analysis and predictive insights can inform decision-making at both
clinical and strategic levels. The company can leverage this information
to implement new public health policies, ensuring a positive impact on
neonatal morbidity and mortality rates.
This section contains all the code and the functions used to conduct the analysis and implement the predictive model
library('ggplot2')
library('gghalves')
library('gridExtra')
library('dplyr')
library('moments')
library('car')
library('MASS')
library('lmtest')
data_set = read.csv("neonati.csv",
stringsAsFactors = T)
# number of data_set rows
n = nrow(data_set)
# quantitative continuos variable analysis
box_plot_violin = function(data_set, variable, color, title, axis_description) {
ggplot(data=data_set,
aes(x = "", y = .data[[variable]]))+
geom_half_boxplot(side = 'l',
fill = color,
alpha = 0.7,) +
geom_half_violin(side = 'r',
fill = color,
alpha = 0.3,) +
labs(
title = title,
x = "",
y = axis_description
) +
theme_minimal()
}
# generate a box plot with a density distribution for the following quantitative continuos variables:
# Anni.madre
# Gestazione
# Cranio
# Lunghezza
# Peso
distr_anni_madre_plot = box_plot_violin(data_set = data_set,
variable = "Anni.madre",
color = "lightsalmon",
title = "Anni.madre",
axis_description = "Age")
distr_gestazione_plot = box_plot_violin(data_set = data_set,
variable = "Gestazione",
color = "yellow",
title = "Gestazione",
axis_description = "weeks")
distr_craio_plot = box_plot_violin(data_set = data_set,
variable = "Cranio",
color = "violet",
title = "Cranio",
axis_description = "mm")
distr_lunghezza_plot = box_plot_violin(data_set = data_set,
variable = "Lunghezza",
color = "lightgreen",
title = "Lunghezza",
axis_description = "mm")
distr_peso_plot = box_plot_violin(data_set = data_set,
variable = "Peso",
color = "lightblue",
title = "Peso",
axis_description = "g")
# generate a bar plot for the following quantitative discrete variables:
# N. Gravidanze
bar_plot_graph = function(data_set, variable, color, x_axis_description, y_axis_description){
ggplot(data = data_set, aes(x = factor(.data[[variable]]))) +
geom_bar(aes(y = after_stat(count / sum(count))),
fill = color, color = color, width = 0.7) +
scale_y_continuous(labels = scales::percent_format()) +
labs(
title = variable,
x = x_axis_description,
y = y_axis_description
) +
theme_minimal()
}
distr_n_gravidanze_plot = bar_plot_graph(data_set = data_set,
variable = "N.gravidanze",
color = "steelblue",
x_axis_description = "Number of previous pregnancies",
y_axis_description = "Percentage of mothers")
# generate a stacked bar plot for the following qualitative variable
# Fumatrici
# Tipo.parto
# Ospedale
# Sesso
stacked_bar_plot = function(data_set, variable, vcol, vlab, y_axis_description){
ggplot(data = data_set,
aes(x="",
fill = factor(.data[[variable]])))+
geom_bar(position = "fill",
stat = "count",
alpha = 0.8) +
scale_y_continuous(labels = scales::percent_format()) +
scale_fill_manual(
values = vcol,
labels = vlab
) +
labs(
x = variable,
y = y_axis_description
) +
theme_minimal()+
theme(
legend.position = "bottom",
legend.title = element_blank()
)
}
fumatrici_plot = stacked_bar_plot(
data_set = data_set,
variable = "Fumatrici",
vcol = c("green", "red"),
vlab = c("No", "Yes"),
y_axis_description = "%"
)
tipo_parto_plot = stacked_bar_plot(
data_set = data_set,
variable = "Tipo.parto",
vcol = c("red", "green"),
vlab = c("Cesarean", "Natural"),
y_axis_description = "%"
)
ospedale_plot = stacked_bar_plot(
data_set = data_set,
variable = "Ospedale",
vcol = c("blue", "yellow","violet"),
vlab = c("osp1", "osp2", "osp3"),
y_axis_description = "%"
)
sesso_plot = stacked_bar_plot(
data_set = data_set,
variable = "Sesso",
vcol = c("lightblue", "salmon"),
vlab = c("M", "F"),
y_axis_description = "%"
)
# this plot highlights the diffrences betweeen the delivery type through the three
# hospital
delivey_comparison_hospital = ggplot(data_set,
aes(x = Ospedale, fill = Tipo.parto)) +
geom_bar(position = "fill", color = "white") +
scale_y_continuous(labels = scales::percent_format()) +
scale_fill_manual(
values = c("Nat" = "green", "Ces" = "red"),
labels = c("Nat" = "Natural", "Ces" = "Cesarean")
) +
labs(
title = "Proportion of Delivery Type by Hospital",
x = "Hospital",
y = "Percentage",
fill = "Delivery Type"
) +
theme_minimal() +
theme(
legend.position = "bottom"
)
# This dataframe store the population mean of Peso and Lunghezza
# Reference link: https://www.ospedalebambinogesu.it/da-0-a-30-giorni-come-si-presenta-e-come-cresce-80012/
get_pop_peso_lunghezza = function(){
Peso_mu_0 = 3300 #[g]
Lunghezza_mu_0 = 500 #[mm]
return(
data.frame(
Peso_mu_0,
Lunghezza_mu_0
)
)
}
pop_peso_lunghezza= get_pop_peso_lunghezza()
# function to create a box plot by a specified category for a specified variable
# the fuction will be used to create box plot for the following variables
# - Peso
# - Lunghezza
# - Cranio
# Using the category Sesso
box_plot_variable_by_cat = function(data_set, cat, variable, color, title, ylab){
ggplot(data = data_set)+
geom_boxplot(aes(x = .data[[cat]], y= .data[[variable]],
fill = as.factor(.data[[cat]])),
fill = color,
alpha = 0.7)+
labs(
title = title,
x = cat,
y = ylab
)+
theme_minimal()+
theme(
legend.title=element_blank(),
legend.position="bottom"
)
}
# Peso comparison per Sesso
peso_by_sesso_box_plot = box_plot_variable_by_cat(
data_set = data_set,
cat = "Sesso",
variable = "Peso",
color = "lightblue",
title = "Peso per Sesso",
ylab = "Peso [g]"
)
# Peso comparison per Ospedale
peso_by_ospedale_box_plot = box_plot_variable_by_cat(
data_set = data_set,
cat = "Ospedale",
variable = "Peso",
color = "lightblue",
title = "Peso per Ospedale",
ylab = "Peso [g]"
)
# Peso comparison per Tipo.parto
peso_by_tipo_parto_box_plot = box_plot_variable_by_cat(
data_set = data_set,
cat = "Tipo.parto",
variable = "Peso",
color = "lightblue",
title = "Peso per Tipo.parto",
ylab = "Peso [g]"
)
# Lunghezza comparison per Sesso
lunghezza_by_sesso_box_plot = box_plot_variable_by_cat(
data_set = data_set,
cat = "Sesso",
variable = "Lunghezza",
color = "lightgreen",
title = "Lunghezza per Sesso",
ylab = "Lunghezza [mm]"
)
# Lunghezza comparison per Sesso
cranio_by_sesso_box_plot = box_plot_variable_by_cat(
data_set = data_set,
cat = "Sesso",
variable = "Cranio",
color = "violet",
title = "Cranio per Sesso",
ylab = "Cranio [mm]"
)
# function to calculated variability and kutosis indexes of Peso
skew_kurt = function(x){
return(cbind(Fisher = skewness(x), Kutosis = kurtosis(x) - 3))
}
# create a function that plot the correlation matrix
panel.cor <- function(x, y, digits = 2, prefix = "", cex.cor, ...)
{
par(usr = c(0, 1, 0, 1))
r <- abs(cor(x, y))
txt <- format(c(r, 0.123456789), digits = digits)[1]
txt <- paste0(prefix, txt)
if(missing(cex.cor)) cex.cor <- 0.8/strwidth(txt)
text(0.5, 0.5, txt, cex = 1.0)
}
correlation_matrix_plot = function(x){
return(pairs(x,
upper.panel = panel.smooth,
lower.panel = panel.cor))
}
# this function create a table to show BIC for the specified model
models_comparison = function(model_1, model_2, perform_anova_test = T) {
# Caluculate ANOVA F-test
if (perform_anova_test == T){
anova_res = anova(model_2, model_1)
cat("ANOVA F-test between the models:\n")
print(anova_res)
cat("\n")
}
# Calculate BIC and AIC values
bic_values = c(BIC(model_1), BIC(model_2))
aic_values = c(AIC(model_1), AIC(model_2))
bic_results <- data.frame(
Model = c("Previous Model", "New Model"),
BIC = bic_values,
AIC = aic_values
)
cat("Models BIC and AIC:\n")
print(bic_results)
invisible(NULL)
}
# plot leverage points of a model
leverage_plot = function(model, number_of_rows){
lev = hatvalues(mod_6)
#plot(lev)
p = sum(lev)
# calculate the threshold
threshold = 2*p/number_of_rows
#abline(h=threshold, col=2)
ggplot(data = NULL,
aes(x = 1:number_of_rows, y = lev)) +
geom_point(shape = 1,
colour = "black") +
geom_hline(yintercept = threshold,
col = "red") +
scale_shape_manual(values = 21:25) +
scale_color_manual(values = c("black", "red")) +
labs(x = "Index",
y = "Leverage") +
labs(
title = "Leverage points")+
theme_minimal()
}
# outliers model analysis
outliers_analysis = function(model){
plot(rstudent(model),
main = "Studentized residuals",
ylab = "rstudent",
xlab = "Index")
# thresholds
abline(h=c(-2,2), col= 2)
# t-test
print(outlierTest(model))
# plots cook's distances
c.dist=cooks.distance(model)
plot(c.dist,
main = "Cook's distance",
ylab = "Cook's D",
xlab = "Index")
print('max cook distance')
print(max(c.dist))
# Residual density
plot(density(residuals(model)),
main = "Residual density",
xlab = "Residuals")
}
# plot model graphical evaluations
model_graphs_plot = function(data_set, var, resp, cat, title){
ggplot(data=data_set)+
geom_point(aes(
x= .data[[var]],
y= .data[[resp]],
col = .data[[cat]]
),
alpha = 0.3,
position = "jitter")+
geom_smooth(
aes(x = Gestazione, y = .data[[resp]], group = .data[[cat]]),
colour = "white",
linewidth = 1.5,
se = FALSE,
method = "lm"
) +
geom_smooth(aes(
x= .data[[var]],
y= .data[[resp]],
col = .data[[cat]]
),
se = F,
method = 'lm')+
labs(
title = title
)+
theme_minimal()+
theme(
legend.position = "bottom",
legend.title = element_blank()
)
}
To build the predictive model, data were collected on 2,500 newborns from three hospitals. The dataset includes the following variables:
Età della madre (Mother’s age): measured in years.
Numero di gravidanze (Number of pregnancies): total number of pregnancies the mother has had.
Fumo materno (Maternal smoking): binary indicator (0 = non-smoker, 1 = smoker)
Durata della gravidanza (Gestational duration): number of weeks of gestation.
Peso del neonato (Birth weight): neonatal weight at birth, measured in grams.
Lunghezza e diametro del cranio (Length and head circumference): neonatal body length and cranial diameter, which can also be measured during pregnancy via ultrasound.
Tipo di parto (Type of delivery): natural (vaginal) or cesarean delivery.
Ospedale di nascita (Birth hospital): Hospital 1, 2, or 3.
Sesso del neonato (Newborn’s sex): male (M) or female (F).
The primary objective is to identify which of these variables are the most predictive of birth weight, with a specific focus on the impact of maternal smoking and gestational duration, as these factors may indicate a higher likelihood of premature birth.
The summary of the statististics of the dataset vaiables show the following information:
summary(data_set)
## Anni.madre N.gravidanze Fumatrici Gestazione
## Min. : 0.00 Min. : 0.0000 Min. :0.0000 Min. :25.00
## 1st Qu.:25.00 1st Qu.: 0.0000 1st Qu.:0.0000 1st Qu.:38.00
## Median :28.00 Median : 1.0000 Median :0.0000 Median :39.00
## Mean :28.16 Mean : 0.9812 Mean :0.0416 Mean :38.98
## 3rd Qu.:32.00 3rd Qu.: 1.0000 3rd Qu.:0.0000 3rd Qu.:40.00
## Max. :46.00 Max. :12.0000 Max. :1.0000 Max. :43.00
## Peso Lunghezza Cranio Tipo.parto Ospedale Sesso
## Min. : 830 Min. :310.0 Min. :235 Ces: 728 osp1:816 F:1256
## 1st Qu.:2990 1st Qu.:480.0 1st Qu.:330 Nat:1772 osp2:849 M:1244
## Median :3300 Median :500.0 Median :340 osp3:835
## Mean :3284 Mean :494.7 Mean :340
## 3rd Qu.:3620 3rd Qu.:510.0 3rd Qu.:350
## Max. :4930 Max. :565.0 Max. :390
To perform an optimal description the variable analysis will be separated by type as follow.
grid.arrange(distr_anni_madre_plot,
distr_gestazione_plot,
distr_craio_plot,
distr_lunghezza_plot,
distr_peso_plot,
nrow = 2,
ncol = 3)
| Variable | Remarks |
|---|---|
| Anni.madre | Quantitative discrete variable in interval scale treated as continuos. The observed outliers close to zero are likely attributable to data-entry errors and should be verified. Such values may act as high-leverage points, potentially affecting model stability. |
| Gestazione | Quantitative discrete variable in ratio scale treated as continuos. It is evident that most births occur during the normal gestational period, that is, after the 37th week of pregnancy. |
| Cranio | Quantitative continuous variable in ratio scale. The outliers observed in the lower tail of the distribution may indicate preterm births, as smaller head circumference values are often associated with reduced gestational duration. |
| Lunghezza | Quantitative continuous variable in ratio scale. The same considerations of the variable Cranio are valid. |
| Peso | Quantitative continuous variable in ratio scale. The same consideration of the variables Cranio and Lunghezza are valid. In add, a visual inspection of its distribution suggests that Peso approximates a normal (Gaussian) distribution, with a symmetric shape centered around the mean. However, this assumption will need to be formally verified through appropriate normality tests in the subsequent analysis. |
Regarding
distr_n_gravidanze_plot
| Variable | Remarks |
|---|---|
| N.gravidanze | Quantitative discrete variable in ratio scale The average value close to 1 meaning they have had no or only one previous pregnancy. |
grid.arrange(fumatrici_plot,
tipo_parto_plot,
ospedale_plot,
sesso_plot,
nrow = 2,
ncol = 2)
| Variable | Remarks |
|---|---|
| Fumatrici | Qualitative binary variable in nominal scale. 0 (No) = non-smoker, 1 (Yes) = smoker The majority of mothers no smoke during the pregnancy. |
| Tipo.parto | Qualitative variable on nominal scale. The majority of births were natural deliveries, whereas cesarean sections represented a smaller fraction of the sample. |
| Ospedale | Qualitative variable on nominal scale. The data are distributed among three hospitals (osp1, osp2, osp3). This balanced distribution indicates that the sample is well diversified across the participating institutions, reducing potential bias associated with single-hospital data collection. |
| Sesso | Qualitative variable on nominal scale. The data show an almost equal distribution between male and female infants, suggesting a well-balanced sample with respect to gender. This balance sudgest that subsequent analyses are not affected by sex-related sampling bias. |
Following the exploratory descriptive analysis, a data cleaning procedure is carried out to correct data inconsistencies, handle outliers, and recode variables with non-normal or highly skewed distributions following this points:
data_set = data_set %>%
filter(
Anni.madre > 1
)
# update the number of rows
n = nrow(data_set)
data_set$Fumatrici <- as.factor(data_set$Fumatrici)
Create a categorical classification of N.gravidanze. This variable exhibits a markedly skewed and non-normal distribution. Consequently, it is recoded into three categorical classes:
[0, 1): no previous pregnancies.
[1, 2): 1 previous pregnancies.
[2+): 2 or more previous pregnancies.
adding a new column to the dataset: “N.gravidanze_class” :
N.gravidanze_class = cut(data_set$N.gravidanze,
breaks = c(0, 1, 2, Inf),
include.lowest = T,
right = F,
labels = c("[0,1)", "[1,2)", "[2+)"))
data_set['N.gravidanze_class'] = N.gravidanze_class
N.gravidanze_class_plot = bar_plot_graph(data_set = data_set,
variable = "N.gravidanze_class",
color = "steelblue",
x_axis_description = "Number of previous pregnancies",
y_axis_description = "Percentage of mothers")
N.gravidanze_class_plot
The following summarization shows the percentage fraction of delivery type (Ces and Nat) among three hospitals.
round(prop.table(table(data_set$Ospedale, data_set$Tipo.parto), 1) * 100,2)
##
## Ces Nat
## osp1 29.66 70.34
## osp2 29.95 70.05
## osp3 27.82 72.18
delivey_comparison_hospital
The differences among three hospitals are very slight, at around 2%, which suggests that they all follow the same medical standards regarding pregnancy.
To asses this a Chi-square test with a significance level of 5% is performed.
chisq.test(table(data_set$Ospedale, data_set$Tipo.parto))
##
## Pearson's Chi-squared test
##
## data: table(data_set$Ospedale, data_set$Tipo.parto)
## X-squared = 1.083, df = 2, p-value = 0.5819
The test confirms the assumptions above. The \(H_0\) hypotesis of indipendence is not reject\((p-value > 0.05)\) indicating that the proportion of cesarean deliveries does not differ significantly among hospitals.
The following output shows the population mean values of Peso and Lughezza (https://www.ospedalebambinogesu.it/da-0-a-30-giorni-come-si-presenta-e-come-cresce-80012/).
pop_peso_lunghezza
## Peso_mu_0 Lunghezza_mu_0
## 1 3300 500
The following plots provide a detailed visualization of Peso and Lunghezza.
grid.arrange(distr_peso_plot,
distr_lunghezza_plot,
nrow = 1,
ncol = 2)
Given the large sample size (n ≈ 2500), the Central Limit Theorem
allows the assumption that the sampling distributions of Peso
and Lunghezza are approximately normal.
This assumption is further supported by the results of the Shapiro–Wilk
test, which did not indicate significant departures from normality, as
well as by the graphical representations obtained in the descriptive
analysis, where the density plots and box plots show approximately
symmetric and bell-shaped distributions.
shapiro.test(data_set$Peso)
##
## Shapiro-Wilk normality test
##
## data: data_set$Peso
## W = 0.97068, p-value < 2.2e-16
shapiro.test(data_set$Lunghezza)
##
## Shapiro-Wilk normality test
##
## data: data_set$Lunghezza
## W = 0.90944, p-value < 2.2e-16
Infact, the null hypothesis \(H_0\) which assumes a normal distribution, is not rejected \(p-value < 0.05\).
By applying a two-sided t-test to each variable and setting the significance level at 5%, it is possible to assess whether each sample mean differs significantly from the corresponding population value.
t.test(data_set$Peso,
mu = pop_peso_lunghezza$Peso_mu_0,
alternative = "two.sided")
##
## One Sample t-test
##
## data: data_set$Peso
## t = -1.505, df = 2497, p-value = 0.1324
## alternative hypothesis: true mean is not equal to 3300
## 95 percent confidence interval:
## 3263.577 3304.791
## sample estimates:
## mean of x
## 3284.184
The \(H_0\) hypotesis that say that the sample mean of Peso not differ from the Population mean is not reject\((p-value > 0.05)\) indicating that the difference is not statistically significantly .
t.test(data_set$Lunghezza,
mu = pop_peso_lunghezza$Lunghezza_mu_0,
alternative = "two.sided")
##
## One Sample t-test
##
## data: data_set$Lunghezza
## t = -10.069, df = 2497, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 500
## 95 percent confidence interval:
## 493.6628 495.7287
## sample estimates:
## mean of x
## 494.6958
The \(H_0\) hypotesis that say that the sample mean of Lunghezza not differ from the Population mean is reject\((p-value < 0.05)\) indicating that the difference is statistically significantly.
These findings suggest that the sampled neonates are representative of the general population in terms of birth weight, but slightly smaller in terms of body length.
The following plots provide a detailed visualization of the anthropometric variables (Peso, Lunghezza, and Cranio).
grid.arrange(distr_peso_plot,
peso_by_sesso_box_plot,
nrow = 1,
ncol = 2)
grid.arrange(distr_lunghezza_plot,
lunghezza_by_sesso_box_plot,
nrow = 1,
ncol = 2)
grid.arrange(distr_craio_plot,
cranio_by_sesso_box_plot,
nrow = 1,
ncol = 2)
Descriptive analysis shows that male newborns generally exhibit
higher anthropometric measurements than female newborns.
To verify whether these differences are statistically significant, a
pairwise t-test is performed, after validating the assumptions
of normality and homoscedasticity.
As per previouse analysis on Peso and Lunghezza,
given the large sample size (n ≈ 2500), the Central Limit Theorem allows
the assumption that the sampling distributions of Cranio is
approximately normal.
This assumption is further supported by the results of the Shapiro–Wilk
test, which did not indicate significant departures from normality, as
well as by the graphical representations obtained in the descriptive
analysis, where the density plots and boxplots show approximately
symmetric and bell-shaped distributions.
shapiro.test(data_set$Cranio)
##
## Shapiro-Wilk normality test
##
## data: data_set$Cranio
## W = 0.96358, p-value < 2.2e-16
Infact, the null hypothesis \(H_0\) which assumes a normal distribution, is not rejected (p-value <= 0.05).
The assumption of homoscedasticity, that is, the equality of
variances between groups, is verified using Levene’s test.
This procedure assesses whether the variability of the anthropometric
measures differs significantly between sexes, a necessary condition for
the appropriate application of the t-test.
leveneTest(Peso ~ Sesso, data = data_set)
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 1 0.8222 0.3646
## 2496
leveneTest(Lunghezza ~ Sesso, data = data_set)
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 1 10.571 0.001164 **
## 2496
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
leveneTest(Cranio ~ Sesso, data = data_set)
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 1 1.2098 0.2715
## 2496
The varaibles Peso and Cranio satisfy both the assumptions of normality and homoscedasticity allowing the use of a two sample t-tests with a fixed significance level \(α = 0.05\).
t.test(Peso ~ Sesso,
data = data_set,
var.equal = T,
alternative = "two.sided")
##
## Two Sample t-test
##
## data: Peso by Sesso
## t = -12.111, df = 2496, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group F and group M is not equal to 0
## 95 percent confidence interval:
## -287.4963 -207.3721
## sample estimates:
## mean in group F mean in group M
## 3161.061 3408.496
t.test(Cranio ~ Sesso,
data = data_set,
var.equal = T,
alternative = "two.sided")
##
## Two Sample t-test
##
## data: Cranio by Sesso
## t = -7.4344, df = 2496, p-value = 1.436e-13
## alternative hypothesis: true difference in means between group F and group M is not equal to 0
## 95 percent confidence interval:
## -6.110877 -3.560044
## sample estimates:
## mean in group F mean in group M
## 337.6231 342.4586
For both variables the p-value is very small, very close to 0, therefore \(H_0\) hypothesis is rejected and it is concluded that the mean values differ significantly between male and female newborns for the variables under analysis.
Regarding the variable Lunghezza, it satisfy the normality assumption but not that of homoscedasticity. Consequently, it is not appropriate to apply the classic two sample t-test but a variant that relax the homoscedasticity assumption, the Welch’s t-test.
t.test(Lunghezza ~ Sesso,
data = data_set,
var.equal = F,
alternative = "two.sided")
##
## Welch Two Sample t-test
##
## data: Lunghezza by Sesso
## t = -9.5823, df = 2457.3, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group F and group M is not equal to 0
## 95 percent confidence interval:
## -11.939001 -7.882672
## sample estimates:
## mean in group F mean in group M
## 489.7641 499.6750
The p-value is very small, very close to 0, therefore \(H_0\) hypothesis is rejected and it is concluded that the mean values differ significantly between male and female newborns for the variables under analysis.
The output variable Peso, as verified in the previous section, approximately follow a normal distribution. In addition, the skewness and kurtosis indices are calculated:
skew_kurt(data_set$Peso)
## Fisher Kutosis
## [1,] -0.6474036 2.028753
The distribution appears slightly right-skewedand leptokurtic, as confirmed by the skewness and kurtosis coefficients.
This indicates that it has heavier tails and sharper peak than the normal distribution; however, it can still be reasonably approximated by a normal distribution.
The correlation of the input variables with Peso is analyzed.
correlation_matrix_plot(data_set)
From the analysis of the quantitative variables, it emerges that the birth weight (Peso) of a newborn is strongly correlated with:
Length (Lunghezza), with a correlation coefficient of 0.80.
Head circumference (Cranio), with a correlation coefficient of 0.70;
as shows in the scatter plots above.
Additionally, these two variables have a weak correlation 0.60 and should be not a problem for the linear moldel.
The qualitative variables are treated differently, as it is difficult to extract meaningful information from them using a correlation matrix. It is only possible to observe that the data points are distributed across the various classes, without providing any clear quantitative relationship.
In the section “2.3 Anthropometric measurements differ significantly
between the two sexes” has been demonstrated that
Sesso exerts a statistically
significant influence on Peso, leading to an
expectedly relevant regression coefficient.
The subsequent step consists of testing the other qualitative predictors
in the same way, to determine whether they also contribute significantly
to the explanation of the newborn’s weight.
The following plots provide a detailed visualization of Ospedale:
grid.arrange(
distr_peso_plot,
peso_by_ospedale_box_plot,
nrow = 1,
ncol = 2
)
The distribution of Peso across the three hospitals appears
very similar, suggesting that there are no statistically significant
deviations in newborn weight among the hospitals.
The assumptions above for mean comparison are verified using a pairwise
t-test, assessing both the normality and homoscedasticity
hypotheses.
As the normality assumption for Peso was already confirmed in
the previous section, it is only necessary to verify the
homoscedasticity assumption by performing the Levene’s tes t.
leveneTest(Peso ~ Ospedale, data = data_set)
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 2 0.5912 0.5537
## 2495
The Levene’s test indicate that the null hypotesys of equal variance
\(H_0\) cannot be rejected \((p-value > 0.05)\).
Therefore, it can be assumed that the variances of Peso are
homogeneous across the three hospitals, satisfying the homoscedasticity
assumption required for the t-test.
The pairwise t-test concludes:
pairwise.t.test(data_set$Peso,
data_set$Ospedale,
p.adjust.method = 'bonferroni',
paired = F,
pool.sd = T)
##
## Pairwise comparisons using t tests with pooled SD
##
## data: data_set$Peso and data_set$Ospedale
##
## osp1 osp2
## osp2 1.00 -
## osp3 0.33 0.32
##
## P value adjustment method: bonferroni
The pairwise t-test, adjusted using the Bonferroni
correction, shows that no pairwise comparison among the three hospitals
is statistically significant \((p-value >
0.05)\).
Therefore, it can be concluded that there are no significant differences
in the mean birth weight of newborns across the hospitals, confirming
that the Peso variable behaves consistently among the three
medical centers.
Consequently the corresponding regression coefficient \(\beta\) is not expected to be significant.
The Bonferroni correction is a conservative adjustment method used to
control the type-I error when multiple hypothesis tests are performed
simultaneously.
Specifically, it divides the overall significance level (α) by the
number of comparisons (m) to obtain a more stringent threshold
for significance.
After evaluating the influence of the Ospedale variable on Peso, the analysis proceeds with the assessment of the remaining qualitative predictors: Tipo.parto, Fumatrici and N.gravidanze_class.
Both Tipo.parto and Fumatrici are binary and potentially relevant for explaining variations in newborn birth weight.
The same consideration applies to N.gravidaze_class whose grouping may help highlight potential relationships with Peso.
The objective of this analysis is to determine whether the delivery type (Tipo.parto) and maternal smoking status (Fumatrici) have a statistically significant impact on newborn weight.
# Peso comparison per N.gravidanze_class
peso_by_N.gravidanze_class_box_plot = box_plot_variable_by_cat(
data_set = data_set,
cat = "N.gravidanze_class",
variable = "Peso",
color = "lightblue",
title = "Peso per N.gravidanze_class",
ylab = "Peso [g]"
)
# Peso comparison per Fumatrici
peso_by_fumatrici_box_plot = box_plot_variable_by_cat(
data_set = data_set,
cat = "Fumatrici",
variable = "Peso",
color = "lightblue",
title = "Peso per Fumatrici",
ylab = "Peso [g]"
)
grid.arrange(
distr_peso_plot,
peso_by_tipo_parto_box_plot,
nrow = 1,
ncol = 2
)
grid.arrange(
distr_peso_plot,
peso_by_fumatrici_box_plot,
nrow = 1,
ncol = 2
)
grid.arrange(
distr_peso_plot,
peso_by_N.gravidanze_class_box_plot,
nrow = 1,
ncol = 2
)
From the graphical analysis, it appears that neither the type of delivery (Tipo.parto) nor maternal smoking (Fumatrici) nor N.gravidanze_class have a significant effect on the weight of the newborn (Peso).
The assumption above needs to be statistically verified using a two sample t-test validating before the homoscedasticity assumptions:
leveneTest(Peso ~ Tipo.parto, data = data_set)
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 1 4.5678 0.03267 *
## 2496
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
leveneTest(Peso ~ Fumatrici, data = data_set)
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 1 1.4267 0.2324
## 2496
leveneTest(Peso ~ N.gravidanze_class, data = data_set)
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 2 1.3904 0.2492
## 2495
The Levene’s test indicates that the homoscedasticity assumption holds for Fumatrici and N.gravidanze_class but not for Tipo.parto.
This means that the variance of Peso is homogeneous between smoking groups but significantly different between the two delivery types.
In this case the analysis will proceed applying the standard two sample t- test for Fumatrici and the Welch’s t-test for Tipo.parto.
t.test(Peso ~ Fumatrici,
data = data_set,
var.equal = T,
alternative = "two.sided")
##
## Two Sample t-test
##
## data: Peso by Fumatrici
## t = 0.94878, df = 2496, p-value = 0.3428
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
## -53.24919 153.08153
## sample estimates:
## mean in group 0 mean in group 1
## 3286.262 3236.346
t.test(Peso ~ Tipo.parto,
data = data_set,
var.equal = F,
alternative = "two.sided")
##
## Welch Two Sample t-test
##
## data: Peso by Tipo.parto
## t = -0.13626, df = 1494.4, p-value = 0.8916
## alternative hypothesis: true difference in means between group Ces and group Nat is not equal to 0
## 95 percent confidence interval:
## -46.44246 40.40931
## sample estimates:
## mean in group Ces mean in group Nat
## 3282.047 3285.063
For the variable N.gravidanze_class a paired t-test, with Bonferroni correction, is applied to evaluate whether there are statistically significant differences in weight between the different classes of number of pregnancies.
pairwise.t.test(data_set$Peso,
data_set$N.gravidanze_class,
p.adjust.method = 'bonferroni',
paired = F,
pool.sd = T)
##
## Pairwise comparisons using t tests with pooled SD
##
## data: data_set$Peso and data_set$N.gravidanze_class
##
## [0,1) [1,2)
## [1,2) 1.00 -
## [2+) 0.93 0.79
##
## P value adjustment method: bonferroni
The test results indicate that the null hypothesis \(H_0\) cannot be rejected, meaning that there are no statistically significant differences in the birth weight of newborns based on either maternal smoking status, delivery type (caesarean versus natural) or Number of previous pregnancies.
Consequently the corresponding regression coefficients \(\beta\) are not expected to be significant.
Therefore, the grouped version of the variable is not retained in the final model, and the original numeric form of N.gravidanze is used instead.
data_set = subset(data_set, select = -N.gravidanze_class)
At this stage a multiple regression model is constructed, initially incorporating all predictor variable to asses their joint influence on Peso.
mod_1 = lm(data_set$Peso~ ., data=data_set)
summary(mod_1)
##
## Call:
## lm(formula = data_set$Peso ~ ., data = data_set)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1123.26 -181.53 -14.45 161.05 2611.89
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6735.7960 141.4790 -47.610 < 2e-16 ***
## Anni.madre 0.8018 1.1467 0.699 0.4845
## N.gravidanze 11.3812 4.6686 2.438 0.0148 *
## Fumatrici1 -30.2741 27.5492 -1.099 0.2719
## Gestazione 32.5773 3.8208 8.526 < 2e-16 ***
## Lunghezza 10.2922 0.3009 34.207 < 2e-16 ***
## Cranio 10.4722 0.4263 24.567 < 2e-16 ***
## Tipo.partoNat 29.6335 12.0905 2.451 0.0143 *
## Ospedaleosp2 -11.0912 13.4471 -0.825 0.4096
## Ospedaleosp3 28.2495 13.5054 2.092 0.0366 *
## SessoM 77.5723 11.1865 6.934 5.18e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 274 on 2487 degrees of freedom
## Multiple R-squared: 0.7289, Adjusted R-squared: 0.7278
## F-statistic: 668.7 on 10 and 2487 DF, p-value: < 2.2e-16
The full regression model explains about the 73% of the variance in Peso (\(R^2=0.7278\)) indicating a strong overall fit.
The F-statistic is highly significant (p < 0.05), confirming the global validity of the model.
The most significatve predictors are Gestazione, Lunghezza, and Cranio, all positively associated with birth weight. Sesso, Tipo.parto and Ospedale are also significant, but with smaller contributions.
Anni.madre and Fumatrici show no significant effect, suggesting that they may be excluded in subsequent model refinements.
The next phase involves the iterative removal of non-significant predictors, starting with Anni.madre.
For each variable removed, an ANOVA F-test will be conducted to assess whether its exclusion leads to a statistically significant loss of explanatory power.
The process will continue recursively until the most parsimonious model with the best explanatory performance is achieved.
mod_2 = update(mod_1, ~. -Anni.madre)
summary(mod_2)
##
## Call:
## lm(formula = data_set$Peso ~ N.gravidanze + Fumatrici + Gestazione +
## Lunghezza + Cranio + Tipo.parto + Ospedale + Sesso, data = data_set)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1113.82 -180.30 -16.22 160.66 2616.32
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6708.6189 136.0211 -49.320 < 2e-16 ***
## N.gravidanze 12.5833 4.3400 2.899 0.00377 **
## Fumatrici1 -30.4268 27.5455 -1.105 0.26944
## Gestazione 32.2996 3.7997 8.501 < 2e-16 ***
## Lunghezza 10.2916 0.3008 34.209 < 2e-16 ***
## Cranio 10.4874 0.4257 24.638 < 2e-16 ***
## Tipo.partoNat 29.6654 12.0892 2.454 0.01420 *
## Ospedaleosp2 -10.9509 13.4442 -0.815 0.41541
## Ospedaleosp3 28.5171 13.4986 2.113 0.03474 *
## SessoM 77.6452 11.1849 6.942 4.91e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 274 on 2488 degrees of freedom
## Multiple R-squared: 0.7288, Adjusted R-squared: 0.7279
## F-statistic: 743.1 on 9 and 2488 DF, p-value: < 2.2e-16
models_comparison(model_1 = mod_1, model_2 = mod_2)
## ANOVA F-test between the models:
## Analysis of Variance Table
##
## Model 1: data_set$Peso ~ N.gravidanze + Fumatrici + Gestazione + Lunghezza +
## Cranio + Tipo.parto + Ospedale + Sesso
## Model 2: data_set$Peso ~ Anni.madre + N.gravidanze + Fumatrici + Gestazione +
## Lunghezza + Cranio + Tipo.parto + Ospedale + Sesso
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 2488 186779904
## 2 2487 186743194 1 36710 0.4889 0.4845
##
## Models BIC and AIC:
## Model BIC AIC
## 1 Previous Model 35215.45 35145.57
## 2 New Model 35208.12 35144.06
The ANOVA test demostreates that Anni.madre not reduce the power of the model to predict the newborn weight.
This new model mod_2 maintains the same capacity to explain about the 73% of the variance in Peso (\(R^2=0.7278\)) as mod_1.
The F-statistic is highly significant (p < 0.05), confirming the global validity of the model, and the BIC and AIC metrics are smaller than those of the previous model. It follows that the new model mod_2 is better than the previous one.
Additionally the variable N.gravidanze increase its significance level respect to mod_1.
mod_3 = update(mod_2, ~. -Fumatrici)
summary(mod_3)
##
## Call:
## lm(formula = data_set$Peso ~ N.gravidanze + Gestazione + Lunghezza +
## Cranio + Tipo.parto + Ospedale + Sesso, data = data_set)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1113.07 -181.71 -16.66 161.08 2619.57
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6707.9252 136.0257 -49.314 < 2e-16 ***
## N.gravidanze 12.3360 4.3344 2.846 0.00446 **
## Gestazione 32.0386 3.7925 8.448 < 2e-16 ***
## Lunghezza 10.3059 0.3006 34.286 < 2e-16 ***
## Cranio 10.4920 0.4257 24.648 < 2e-16 ***
## Tipo.partoNat 29.4080 12.0875 2.433 0.01505 *
## Ospedaleosp2 -10.8939 13.4447 -0.810 0.41786
## Ospedaleosp3 28.7917 13.4969 2.133 0.03301 *
## SessoM 77.4657 11.1842 6.926 5.48e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 274 on 2489 degrees of freedom
## Multiple R-squared: 0.7287, Adjusted R-squared: 0.7278
## F-statistic: 835.7 on 8 and 2489 DF, p-value: < 2.2e-16
models_comparison(model_1 = mod_2, model_2 = mod_3)
## ANOVA F-test between the models:
## Analysis of Variance Table
##
## Model 1: data_set$Peso ~ N.gravidanze + Gestazione + Lunghezza + Cranio +
## Tipo.parto + Ospedale + Sesso
## Model 2: data_set$Peso ~ N.gravidanze + Fumatrici + Gestazione + Lunghezza +
## Cranio + Tipo.parto + Ospedale + Sesso
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 2489 186871503
## 2 2488 186779904 1 91599 1.2201 0.2694
##
## Models BIC and AIC:
## Model BIC AIC
## 1 Previous Model 35208.12 35144.06
## 2 New Model 35201.52 35143.29
The ANOVA test demostreates that Fumatrici not reduce the power of the model to predict the newborn weight and the BIC and AIC metrics are smaller than those of the previous model. It follows that the new model mod_3 is better than the previous one.
This new model, mod_3, maintains the same capacity to explain about the 73% of the variance in Peso (\(R^2=0.7278\)) as mod_2.
mod_4 = update(mod_3, ~. -Ospedale)
summary(mod_4)
##
## Call:
## lm(formula = data_set$Peso ~ N.gravidanze + Gestazione + Lunghezza +
## Cranio + Tipo.parto + Sesso, data = data_set)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1129.14 -181.97 -16.26 160.95 2638.18
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6708.0171 136.0715 -49.298 < 2e-16 ***
## N.gravidanze 12.7356 4.3385 2.935 0.00336 **
## Gestazione 32.3253 3.7969 8.514 < 2e-16 ***
## Lunghezza 10.2833 0.3009 34.177 < 2e-16 ***
## Cranio 10.5063 0.4263 24.648 < 2e-16 ***
## Tipo.partoNat 30.1601 12.1027 2.492 0.01277 *
## SessoM 77.9171 11.1994 6.957 4.42e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 274.4 on 2491 degrees of freedom
## Multiple R-squared: 0.7277, Adjusted R-squared: 0.727
## F-statistic: 1109 on 6 and 2491 DF, p-value: < 2.2e-16
models_comparison(model_1 = mod_3, model_2 = mod_4)
## ANOVA F-test between the models:
## Analysis of Variance Table
##
## Model 1: data_set$Peso ~ N.gravidanze + Gestazione + Lunghezza + Cranio +
## Tipo.parto + Sesso
## Model 2: data_set$Peso ~ N.gravidanze + Gestazione + Lunghezza + Cranio +
## Tipo.parto + Ospedale + Sesso
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 2491 187574428
## 2 2489 186871503 2 702925 4.6812 0.009349 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Models BIC and AIC:
## Model BIC AIC
## 1 Previous Model 35201.52 35143.29
## 2 New Model 35195.25 35148.67
Although Ospedale appeared to be slightly statistically significant in the model mod_3, it was removed because the previous analyses indicated that this variable does not have a meaningful influence on Peso.
The same reasoning also applies to Tipo.parto and N.gravidanze, which was therefore excluded from the model.
mod_5 = update(mod_4, ~. -Tipo.parto)
summary(mod_5)
##
## Call:
## lm(formula = data_set$Peso ~ N.gravidanze + Gestazione + Lunghezza +
## Cranio + Sesso, data = data_set)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1149.37 -180.98 -15.57 163.69 2639.09
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6681.7251 135.8036 -49.201 < 2e-16 ***
## N.gravidanze 12.4554 4.3416 2.869 0.00415 **
## Gestazione 32.3827 3.8008 8.520 < 2e-16 ***
## Lunghezza 10.2455 0.3008 34.059 < 2e-16 ***
## Cranio 10.5410 0.4265 24.717 < 2e-16 ***
## SessoM 77.9807 11.2111 6.956 4.47e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 274.7 on 2492 degrees of freedom
## Multiple R-squared: 0.727, Adjusted R-squared: 0.7265
## F-statistic: 1327 on 5 and 2492 DF, p-value: < 2.2e-16
models_comparison(model_1 = mod_4, model_2 = mod_5)
## ANOVA F-test between the models:
## Analysis of Variance Table
##
## Model 1: data_set$Peso ~ N.gravidanze + Gestazione + Lunghezza + Cranio +
## Sesso
## Model 2: data_set$Peso ~ N.gravidanze + Gestazione + Lunghezza + Cranio +
## Tipo.parto + Sesso
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 2492 188042054
## 2 2491 187574428 1 467626 6.2101 0.01277 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Models BIC and AIC:
## Model BIC AIC
## 1 Previous Model 35195.25 35148.67
## 2 New Model 35193.65 35152.89
mod_6 = update(mod_5, ~. -N.gravidanze)
summary(mod_6)
##
## Call:
## lm(formula = data_set$Peso ~ Gestazione + Lunghezza + Cranio +
## Sesso, data = data_set)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1138.1 -184.4 -17.4 163.3 2626.7
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6651.6732 135.5952 -49.055 < 2e-16 ***
## Gestazione 31.3262 3.7884 8.269 < 2e-16 ***
## Lunghezza 10.2024 0.3009 33.909 < 2e-16 ***
## Cranio 10.6706 0.4247 25.126 < 2e-16 ***
## SessoM 79.1027 11.2205 7.050 2.31e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 275.1 on 2493 degrees of freedom
## Multiple R-squared: 0.7261, Adjusted R-squared: 0.7257
## F-statistic: 1652 on 4 and 2493 DF, p-value: < 2.2e-16
models_comparison(model_1 = mod_5, model_2 = mod_6)
## ANOVA F-test between the models:
## Analysis of Variance Table
##
## Model 1: data_set$Peso ~ Gestazione + Lunghezza + Cranio + Sesso
## Model 2: data_set$Peso ~ N.gravidanze + Gestazione + Lunghezza + Cranio +
## Sesso
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 2493 188663107
## 2 2492 188042054 1 621053 8.2304 0.004154 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Models BIC and AIC:
## Model BIC AIC
## 1 Previous Model 35193.65 35152.89
## 2 New Model 35194.06 35159.12
The model mod_6 appears to be the most appropriate
specification.
Despite using a smaller number of predictors, it maintains nearly the same level of explained variance as the previous model about 73%, while showing an improved F-statistic with a p-value < 0.05 and a BIC value almost identical to that of the previous model mod_5.
This indicates that the reduced model achieves a more efficient balance between parsimony and explanatory power.
As previously explained the variable Gestazione increases, demostrating an increasing asymptotic effect that could be explained by including the logarithmic contribution of Gestazione in the model.
mod_7 = update(mod_6, ~. +I(log(Gestazione)))
summary(mod_7)
##
## Call:
## lm(formula = data_set$Peso ~ Gestazione + Lunghezza + Cranio +
## Sesso + I(log(Gestazione)), data = data_set)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1141.78 -181.60 -14.94 162.19 2652.02
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4651.2180 4258.5700 1.092 0.274850
## Gestazione 148.5229 44.2956 3.353 0.000811 ***
## Lunghezza 10.3199 0.3038 33.975 < 2e-16 ***
## Cranio 10.7828 0.4263 25.296 < 2e-16 ***
## SessoM 76.6305 11.2455 6.814 1.18e-11 ***
## I(log(Gestazione)) -4360.2156 1641.9597 -2.655 0.007970 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 274.8 on 2492 degrees of freedom
## Multiple R-squared: 0.7269, Adjusted R-squared: 0.7263
## F-statistic: 1326 on 5 and 2492 DF, p-value: < 2.2e-16
models_comparison(model_1 = mod_6, model_2 = mod_7)
## ANOVA F-test between the models:
## Analysis of Variance Table
##
## Model 1: data_set$Peso ~ Gestazione + Lunghezza + Cranio + Sesso + I(log(Gestazione))
## Model 2: data_set$Peso ~ Gestazione + Lunghezza + Cranio + Sesso
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 2492 188130750
## 2 2493 188663107 -1 -532357 7.0517 0.00797 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Models BIC and AIC:
## Model BIC AIC
## 1 Previous Model 35194.06 35159.12
## 2 New Model 35194.83 35154.06
This contribution does not improve the model. The logarithmic contribution is less significant than the linear one; therefore, the mod_6 model is still considered the best.
To must to be sure to cover all the possibility a new model called mod_8 is created using the automatic step wise procedure that minimizes the BIC.
This criterion is preferred because it applies a stricter penalty to models with a greater number of parameters, ensuring simplicity without compromising explanatory accuracy.
mod_8 = stepAIC(mod_1, direction = 'both', k=log(n))
## Start: AIC=28118.61
## data_set$Peso ~ Anni.madre + N.gravidanze + Fumatrici + Gestazione +
## Lunghezza + Cranio + Tipo.parto + Ospedale + Sesso
##
## Df Sum of Sq RSS AIC
## - Anni.madre 1 36710 186779904 28111
## - Fumatrici 1 90677 186833870 28112
## - Ospedale 2 687555 187430749 28112
## - N.gravidanze 1 446244 187189438 28117
## - Tipo.parto 1 451073 187194266 28117
## <none> 186743194 28119
## - Sesso 1 3610705 190353899 28159
## - Gestazione 1 5458852 192202046 28183
## - Cranio 1 45318506 232061700 28654
## - Lunghezza 1 87861708 274604902 29074
##
## Step: AIC=28111.28
## data_set$Peso ~ N.gravidanze + Fumatrici + Gestazione + Lunghezza +
## Cranio + Tipo.parto + Ospedale + Sesso
##
## Df Sum of Sq RSS AIC
## - Fumatrici 1 91599 186871503 28105
## - Ospedale 2 693914 187473818 28105
## - Tipo.parto 1 452049 187231953 28110
## <none> 186779904 28111
## - N.gravidanze 1 631082 187410986 28112
## + Anni.madre 1 36710 186743194 28119
## - Sesso 1 3617809 190397713 28151
## - Gestazione 1 5424800 192204704 28175
## - Cranio 1 45569477 232349381 28649
## - Lunghezza 1 87852027 274631931 29066
##
## Step: AIC=28104.68
## data_set$Peso ~ N.gravidanze + Gestazione + Lunghezza + Cranio +
## Tipo.parto + Ospedale + Sesso
##
## Df Sum of Sq RSS AIC
## - Ospedale 2 702925 187574428 28098
## - Tipo.parto 1 444404 187315907 28103
## <none> 186871503 28105
## - N.gravidanze 1 608136 187479640 28105
## + Fumatrici 1 91599 186779904 28111
## + Anni.madre 1 37633 186833870 28112
## - Sesso 1 3601860 190473363 28145
## - Gestazione 1 5358199 192229702 28168
## - Cranio 1 45613331 232484834 28642
## - Lunghezza 1 88259386 275130889 29063
##
## Step: AIC=28098.41
## data_set$Peso ~ N.gravidanze + Gestazione + Lunghezza + Cranio +
## Tipo.parto + Sesso
##
## Df Sum of Sq RSS AIC
## - Tipo.parto 1 467626 188042054 28097
## <none> 187574428 28098
## - N.gravidanze 1 648873 188223301 28099
## + Ospedale 2 702925 186871503 28105
## + Fumatrici 1 100610 187473818 28105
## + Anni.madre 1 44184 187530244 28106
## - Sesso 1 3644818 191219246 28139
## - Gestazione 1 5457887 193032315 28162
## - Cranio 1 45747094 233321522 28636
## - Lunghezza 1 87955701 275530129 29051
##
## Step: AIC=28096.81
## data_set$Peso ~ N.gravidanze + Gestazione + Lunghezza + Cranio +
## Sesso
##
## Df Sum of Sq RSS AIC
## <none> 188042054 28097
## - N.gravidanze 1 621053 188663107 28097
## + Tipo.parto 1 467626 187574428 28098
## + Ospedale 2 726146 187315907 28103
## + Fumatrici 1 92548 187949505 28103
## + Anni.madre 1 45366 187996688 28104
## - Sesso 1 3650790 191692844 28137
## - Gestazione 1 5477493 193519547 28161
## - Cranio 1 46098547 234140601 28637
## - Lunghezza 1 87532691 275574744 29044
summary(mod_8)
##
## Call:
## lm(formula = data_set$Peso ~ N.gravidanze + Gestazione + Lunghezza +
## Cranio + Sesso, data = data_set)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1149.37 -180.98 -15.57 163.69 2639.09
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6681.7251 135.8036 -49.201 < 2e-16 ***
## N.gravidanze 12.4554 4.3416 2.869 0.00415 **
## Gestazione 32.3827 3.8008 8.520 < 2e-16 ***
## Lunghezza 10.2455 0.3008 34.059 < 2e-16 ***
## Cranio 10.5410 0.4265 24.717 < 2e-16 ***
## SessoM 77.9807 11.2111 6.956 4.47e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 274.7 on 2492 degrees of freedom
## Multiple R-squared: 0.727, Adjusted R-squared: 0.7265
## F-statistic: 1327 on 5 and 2492 DF, p-value: < 2.2e-16
models_comparison(model_1 = mod_6, model_2 = mod_8)
## ANOVA F-test between the models:
## Analysis of Variance Table
##
## Model 1: data_set$Peso ~ N.gravidanze + Gestazione + Lunghezza + Cranio +
## Sesso
## Model 2: data_set$Peso ~ Gestazione + Lunghezza + Cranio + Sesso
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 2492 188042054
## 2 2493 188663107 -1 -621053 8.2304 0.004154 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Models BIC and AIC:
## Model BIC AIC
## 1 Previous Model 35194.06 35159.12
## 2 New Model 35193.65 35152.89
The automatic stepwise procedure returns a model that includes the same predictors as mod_5, which was previously discarded due to its very low correlation with Peso. Therefore, these results confirm that mod_6 represents the most appropriate and statistically robust model specification.
summary(mod_6)
##
## Call:
## lm(formula = data_set$Peso ~ Gestazione + Lunghezza + Cranio +
## Sesso, data = data_set)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1138.1 -184.4 -17.4 163.3 2626.7
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6651.6732 135.5952 -49.055 < 2e-16 ***
## Gestazione 31.3262 3.7884 8.269 < 2e-16 ***
## Lunghezza 10.2024 0.3009 33.909 < 2e-16 ***
## Cranio 10.6706 0.4247 25.126 < 2e-16 ***
## SessoM 79.1027 11.2205 7.050 2.31e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 275.1 on 2493 degrees of freedom
## Multiple R-squared: 0.7261, Adjusted R-squared: 0.7257
## F-statistic: 1652 on 4 and 2493 DF, p-value: < 2.2e-16
For mod_6, a multicollinearity analysis is first performed among the regressors.
vif(mod_6)
## Gestazione Lunghezza Cranio Sesso
## 1.654101 2.070582 1.606316 1.038918
The results does not show evidence of multicollinearity among the predictors, all VIF values are below the critical threshold of 5.
The next step is to analyse the outliers and residuals:
par(mfrow = c(2, 2))
plot(mod_6)
The following table provide a summary on the diagnostic plots:
| Plot name | Remarks |
|---|---|
| Residual vs Fitted | The residuals seem more or less distributed around 0, which suggests that there are no problems related to non-linearity. |
| Normal Q-Q | The points are mostly lined up along the diagonal line but not in the tails, which suggests that the normality assumption may be isn’t met. |
| Scale-Location | The dispersion of the residuals does not appear to increase systematically with the fitted values, but the pattern seems slightly shifted. This suggests that the homoscedasticity assumption may not be fully satisfied. |
| Residuals vs Leverage | The presence of a few observations with high leverage suggests the need for further investigation into their potential impact on model stability and coefficient estimates. |
To provide a strict statistical validation on the descriptions above the following assumption are validated:
residuals normality assumption
residuals homoscedasticity assumption
residuals no-correlation assumptions
shapiro.test(residuals(mod_6))
##
## Shapiro-Wilk normality test
##
## data: residuals(mod_6)
## W = 0.97426, p-value < 2.2e-16
bptest(mod_6)
##
## studentized Breusch-Pagan test
##
## data: mod_6
## BP = 89.198, df = 4, p-value < 2.2e-16
dwtest(mod_6)
##
## Durbin-Watson test
##
## data: mod_6
## DW = 1.9553, p-value = 0.1318
## alternative hypothesis: true autocorrelation is greater than 0
The unique assumption validated is the residulas no-correlation. At this stage it is important focusing the analysis on leverage points and outliers.
leverage_plot(model = mod_6, number_of_rows = n)
outliers_analysis(mod_6)
## rstudent unadjusted p-value Bonferroni p
## 1549 9.980701 4.9791e-23 1.2438e-19
## 155 4.948947 7.9596e-07 1.9883e-03
## 1305 4.779033 1.8638e-06 4.6558e-03
## [1] "max cook distance"
## [1] 0.9780498
The outlier test highlights a few observations that may
disproportionately affect the model.
Their exclusion allows evaluating whether the resulting model provides a
better representation of the data.
outlier_index = c(1549, 155, 1305)
data_set_no_outliers <- data_set[-outlier_index, ]
final_model = lm(Peso ~ Gestazione + Lunghezza + Cranio + Sesso,
data = data_set_no_outliers)
summary(final_model)
##
## Call:
## lm(formula = Peso ~ Gestazione + Lunghezza + Cranio + Sesso,
## data = data_set_no_outliers)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1155.47 -180.59 -14.36 163.23 1131.14
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6662.1067 131.8102 -50.543 < 2e-16 ***
## Gestazione 28.0935 3.6904 7.613 3.79e-14 ***
## Lunghezza 10.9552 0.2997 36.549 < 2e-16 ***
## Cranio 9.9707 0.4171 23.903 < 2e-16 ***
## SessoM 78.7775 10.9057 7.224 6.70e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 267.2 on 2490 degrees of freedom
## Multiple R-squared: 0.7405, Adjusted R-squared: 0.7401
## F-statistic: 1776 on 4 and 2490 DF, p-value: < 2.2e-16
models_comparison(model_1 = mod_6, model_2 = final_model, perform_anova_test = F)
## Models BIC and AIC:
## Model BIC AIC
## 1 Previous Model 35194.06 35159.12
## 2 New Model 35006.04 34971.11
The regression model re-estimated after excluding the influential
observations demonstrates improved performance compared to the initial
specification.
All diagnostic indicators including the adjusted \(R^2\), F-statistic, and residual standard
error show favorable changes, confirming that the removal of outliers
enhances the model’s explanatory power and overall reliability.
par(mfrow = c(2, 2))
plot(final_model)
and Executing the residuals and outliers test:
shapiro.test(residuals(final_model))
##
## Shapiro-Wilk normality test
##
## data: residuals(final_model)
## W = 0.99189, p-value = 1.31e-10
bptest(final_model)
##
## studentized Breusch-Pagan test
##
## data: final_model
## BP = 7.0521, df = 4, p-value = 0.1332
dwtest(final_model)
##
## Durbin-Watson test
##
## data: final_model
## DW = 1.9553, p-value = 0.132
## alternative hypothesis: true autocorrelation is greater than 0
outliers_analysis(final_model)
## rstudent unadjusted p-value Bonferroni p
## 1397 -4.345182 1.4474e-05 0.036113
## [1] "max cook distance"
## [1] 0.07703717
The only assumption that is not fully validated concerns the
normality of the residuals.
However, the residual density function shows a distribution reasonably
close to normality, suggesting that this deviation does not
substantially affect the model’s validity.
Moreover, Cook’s distances have significantly decreased in the refined
model, with a maximum value of 0.077, well below the conventional
threshold of concern. Therefore, the identified outliers can be
reasonably disregarded, and the current model, final_model can
be retained as the best-fitting and most robust version.
Once the model is validated, we will use it to make practical predictions. For example, we can estimate the weight of a female newborn considering a mother in her third pregnancy who will give birth at 39 weeks.
The model includes the variables Cranio and
Lunghezza, which represent the main anthropometric predictors
of birth weight but not the number of previous pregnancies
N.gravidanze.
For the predictive estimation, the average values of these two variables
are used, as they correspond to typical neonatal measurements within the
analyzed sample.
case_1 = data.frame(
Gestazione = 39,
Lunghezza = mean(data_set_no_outliers$Lunghezza),
Cranio = mean(data_set_no_outliers$Cranio),
Sesso = "F")
predict(final_model,
newdata = case_1,
interval = "confidence")
## fit lwr upr
## 1 3244.307 3229.361 3259.254
Using the validated regression model (final_model), the
predicted birth weight for a female newborn at the 39th gestational
week, with average anthropometric characteristics (mean skull diameter
and body length), is 3244.3 grams.
The 95% confidence interval ranges from 3229.4 g to 3259.3
g, indicating a narrow uncertainty band and suggesting that the
model provides a highly stable prediction around the expected mean
value.
This section communicate the model’s results and the most significant relationships between variables using the following visual representations.
model_graphs_plot(data_set = data_set_no_outliers,
var = "Gestazione",
resp = "Peso",
cat = "Sesso",
title = "Peso per Gestazione by Sesso")
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
The graph clearly shows a positive, nearly linear relationship between gestation (Gestazione) and weight (Peso), broken down by gender (female and male).
The regression lines (one for each group) are almost parallel, suggesting that:
The trend of weight growth with gestational length is consistent for both genders.
Males tend to be slightly heavier than females at the same gestational age, which is consistent with the findings of the statistical tests and the linear model.
model_graphs_plot(data_set = data_set_no_outliers,
var = "Gestazione",
resp = "Peso",
cat = "Fumatrici",
title = "Peso per Gestazione by Fumatrici")
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
The graph shows that the relationship between gestational age (Gestazione) and birth weight is positive and almost linear for both categories of mother, regardless of whether they smoke (0 = no-smoker, 1 = smoker). The two lines almost overlap, suggesting that smoking has no statistically significant effect on birth weight.
However, the greater proportion of smokers compared to non-smokers in the sample reduces the statistical power of the test and increases the uncertainty in the estimate. A more reliable assessment would require a more balanced sample.