library(ggpubr)
## Caricamento del pacchetto richiesto: ggplot2
library(moments)
library(car)
## Caricamento del pacchetto richiesto: carData
library(lmtest)
## Caricamento del pacchetto richiesto: zoo
##
## Caricamento pacchetto: 'zoo'
## I seguenti oggetti sono mascherati da 'package:base':
##
## as.Date, as.Date.numeric
This project aims to build a statistical model to predict newborn birth weight, based on different clinical and demographic variables. Having this information at the time of birth represent a great opportunity to improve clinical planning and to reduce risks associated with preterm birth or low birth weight infants.
#Dataset importing
data = read.csv ("neonati.csv",stringsAsFactors=TRUE)
#Visualizing variables type and summarizing statistics values
str(data)
## 'data.frame': 2500 obs. of 10 variables:
## $ Anni.madre : int 26 21 34 28 20 32 26 25 22 23 ...
## $ N.gravidanze: int 0 2 3 1 0 0 1 0 1 0 ...
## $ Fumatrici : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Gestazione : int 42 39 38 41 38 40 39 40 40 41 ...
## $ Peso : int 3380 3150 3640 3690 3700 3200 3100 3580 3670 3700 ...
## $ Lunghezza : int 490 490 500 515 480 495 480 510 500 510 ...
## $ Cranio : int 325 345 375 365 335 340 345 349 335 362 ...
## $ Tipo.parto : Factor w/ 2 levels "Ces","Nat": 2 2 2 2 2 2 2 2 1 1 ...
## $ Ospedale : Factor w/ 3 levels "osp1","osp2",..: 3 1 2 2 3 2 3 1 2 2 ...
## $ Sesso : Factor w/ 2 levels "F","M": 2 1 2 2 1 1 1 2 1 1 ...
summary(data)
## Anni.madre N.gravidanze Fumatrici Gestazione
## Min. : 0.00 Min. : 0.0000 Min. :0.0000 Min. :25.00
## 1st Qu.:25.00 1st Qu.: 0.0000 1st Qu.:0.0000 1st Qu.:38.00
## Median :28.00 Median : 1.0000 Median :0.0000 Median :39.00
## Mean :28.16 Mean : 0.9812 Mean :0.0416 Mean :38.98
## 3rd Qu.:32.00 3rd Qu.: 1.0000 3rd Qu.:0.0000 3rd Qu.:40.00
## Max. :46.00 Max. :12.0000 Max. :1.0000 Max. :43.00
## Peso Lunghezza Cranio Tipo.parto Ospedale Sesso
## Min. : 830 Min. :310.0 Min. :235 Ces: 728 osp1:816 F:1256
## 1st Qu.:2990 1st Qu.:480.0 1st Qu.:330 Nat:1772 osp2:849 M:1244
## Median :3300 Median :500.0 Median :340 osp3:835
## Mean :3284 Mean :494.7 Mean :340
## 3rd Qu.:3620 3rd Qu.:510.0 3rd Qu.:350
## Max. :4930 Max. :565.0 Max. :390
We start by importing the dataset, checking if all the variables have the expected type and visualize a brief summary of the data. During this step we also converted string variables to factors. From the summary() output, we observe that variables such as Ospedale, Sesso, and Tipo_Parto are correctly recognized as factors, each with a limited number of distinct levels. Now we explore the dataset using descriptive statistics.
# We start by creating a vector containing only numeric variables in data
numeric_var = sapply(data, is.numeric)
summary(data[,numeric_var])
## Anni.madre N.gravidanze Fumatrici Gestazione
## Min. : 0.00 Min. : 0.0000 Min. :0.0000 Min. :25.00
## 1st Qu.:25.00 1st Qu.: 0.0000 1st Qu.:0.0000 1st Qu.:38.00
## Median :28.00 Median : 1.0000 Median :0.0000 Median :39.00
## Mean :28.16 Mean : 0.9812 Mean :0.0416 Mean :38.98
## 3rd Qu.:32.00 3rd Qu.: 1.0000 3rd Qu.:0.0000 3rd Qu.:40.00
## Max. :46.00 Max. :12.0000 Max. :1.0000 Max. :43.00
## Peso Lunghezza Cranio
## Min. : 830 Min. :310.0 Min. :235
## 1st Qu.:2990 1st Qu.:480.0 1st Qu.:330
## Median :3300 Median :500.0 Median :340
## Mean :3284 Mean :494.7 Mean :340
## 3rd Qu.:3620 3rd Qu.:510.0 3rd Qu.:350
## Max. :4930 Max. :565.0 Max. :390
# We analyze mean and standard deviation for each numeric variable
sapply(data[,numeric_var],mean)
## Anni.madre N.gravidanze Fumatrici Gestazione Peso Lunghezza
## 28.1640 0.9812 0.0416 38.9804 3284.0808 494.6920
## Cranio
## 340.0292
sapply(data[,numeric_var],sd)
## Anni.madre N.gravidanze Fumatrici Gestazione Peso Lunghezza
## 5.2735783 1.2805868 0.1997133 1.8686392 525.0387442 26.3186437
## Cranio
## 16.4253299
#Along with numeric values we will explore variancy with boxplots
categorical_variables = c("Sesso","Tipo.parto","Ospedale")
for (var in categorical_variables){
formula = reformulate(var,response ="Peso")
#Create boxplot
boxplot(formula,
data=data,
col="lightblue4",
main=paste("Birth weight by",var),
xlab=var,
ylab="Birth weight [grams]")
}
From this first look the dataset seems balanced, we did not found any clear anomolies yet or extreme variance. Before moving on with hypothesis testing we will give a rapid look at the connection between Peso and the two variables that are suggested as influent, Fumatrici and Gestazione.
boxplot(Peso ~ Fumatrici, data = data,
main = "Birth Weight by Maternal Smoking",
col = "red4",
xlab = "Smoking (0 = No, 1 = Yes)",
ylab = "Birth Weight [grams]")
#Pearson correlation coefficient
cor(data$Peso,data$Gestazione)
## [1] 0.5917687
As highlighted in the project text we explore the influence of maternal smoking and gestational duration on birth weight. From the boxplot we can observe that infants born to smoking mother tends to have a slight lower median weight and less variability. To confirm that the variable “Fumo” might help in predcting low weight newborn we need to test its significancy, since the dataset contains way more non smoking mothers.
The pearson coefficient shows a quite strong positive relation between weight and gestational duration as we expected.
We need to test if two categorical variables are associated, we will use the chi-square test. H0: Tipo.parto and Ospedale are independent. H1: They are not independent.
# Build contingency table
parto_hosp_table = table(data$Tipo.parto, data$Ospedale)
ggpubr::ggballoonplot(data=as.data.frame(parto_hosp_table),
fill="green4")
chisq.test(parto_hosp_table)
##
## Pearson's Chi-squared test
##
## data: parto_hosp_table
## X-squared = 1.0972, df = 2, p-value = 0.5778
From the baloon graph we can notice that there is not much difference across hospitals, the high p-value also confirms us that we do not refuse H0 and that the observed differences between the cesarean rates between hospital could easily be due to randomness.
We are going to test if the average Peso and Lunghezza of the dataset is consistent with population averages. To do that we will use a t.test comparing the mean of Peso and lunghezza to the mean value of population. We assume reference values of:
source :Stanford Children’s Health
source: Medicalnews article
# Birth weight
t.test(data$Peso, mu = 3200)
##
## One Sample t-test
##
## data: data$Peso
## t = 8.0071, df = 2499, p-value = 1.782e-15
## alternative hypothesis: true mean is not equal to 3200
## 95 percent confidence interval:
## 3263.490 3304.672
## sample estimates:
## mean of x
## 3284.081
#evaluate threshold values with 5% significancy e n-1 elements
qt(0.025,2499)
## [1] -1.960914
-qt(0.025,2499)
## [1] 1.960914
# Length
t.test(data$Lunghezza, mu = 500)
##
## One Sample t-test
##
## data: data$Lunghezza
## t = -10.084, df = 2499, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 500
## 95 percent confidence interval:
## 493.6598 495.7242
## sample estimates:
## mean of x
## 494.692
Given the results we refuse H0, since the p-value is below 0.05, the population mean falls outside the confidence interval and the t value of 8.01 is outside the range of threshold value (-1,96,1,96 for a two tailed t test with 2499 degrees of freedom and alfa = 0.05) Our sample mean is significantly higher than the population mean.
Given the results we refuse H0, since the p-value is below 0.05, the population mean falls outside the confidence interval and the t value of -10.084 is outside the range of threshold value (-1,96,1,96 for a two tailed t test with 2499 degrees of freedom and alfa = 0.05) Our sample mean is significantly lower than the population mean.
We are going to test if male and female newborns have significantly different means for: - Peso - Lunghezza - Cranio
We consider H0: mu_var_F = mu_var_M
# Compare anthropometric measures by sex
# Weight
t.test(Peso ~ Sesso, data = data)
##
## Welch Two Sample t-test
##
## data: Peso by Sesso
## t = -12.106, df = 2490.7, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group F and group M is not equal to 0
## 95 percent confidence interval:
## -287.1051 -207.0615
## sample estimates:
## mean in group F mean in group M
## 3161.132 3408.215
# Length
t.test(Lunghezza ~ Sesso, data = data)
##
## Welch Two Sample t-test
##
## data: Lunghezza by Sesso
## t = -9.582, df = 2459.3, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group F and group M is not equal to 0
## 95 percent confidence interval:
## -11.929470 -7.876273
## sample estimates:
## mean in group F mean in group M
## 489.7643 499.6672
# Cranial diameter
t.test(Cranio ~ Sesso, data = data)
##
## Welch Two Sample t-test
##
## data: Cranio by Sesso
## t = -7.4102, df = 2491.4, p-value = 1.718e-13
## alternative hypothesis: true difference in means between group F and group M is not equal to 0
## 95 percent confidence interval:
## -6.089912 -3.541270
## sample estimates:
## mean in group F mean in group M
## 337.6330 342.4486
In all cases, the p-values were far below 0.05 (nearly 0), and the confidence intervals did not include zero. So we reject H0 for all three variables, concluding that male and female newborns differ significantly in weight, length, and cranial diameter. Males consistently exhibit higher mean values across all measures.
We are going to start with a model that includes all the variables, explore all the coefficient and only then choose the best model.
n = nrow(data)
plot(density(data$Peso))
#Form index
moments::skewness(data$Peso)
## [1] -0.6470308
moments::kurtosis(data$Peso)-3
## [1] 2.031532
#SHAPIRO TEST
shapiro.test(data$Peso)
##
## Shapiro-Wilk normality test
##
## data: data$Peso
## W = 0.97066, p-value < 2.2e-16
#Correlation Matrix
numeric_matrix = data[, sapply(data, is.numeric)]
correlation_matrix = cor(numeric_matrix)
round(correlation_matrix,2)
## Anni.madre N.gravidanze Fumatrici Gestazione Peso Lunghezza
## Anni.madre 1.00 0.38 0.01 -0.14 -0.02 -0.06
## N.gravidanze 0.38 1.00 0.05 -0.10 0.00 -0.06
## Fumatrici 0.01 0.05 1.00 0.03 -0.02 -0.02
## Gestazione -0.14 -0.10 0.03 1.00 0.59 0.62
## Peso -0.02 0.00 -0.02 0.59 1.00 0.80
## Lunghezza -0.06 -0.06 -0.02 0.62 0.80 1.00
## Cranio 0.02 0.04 -0.01 0.46 0.70 0.60
## Cranio
## Anni.madre 0.02
## N.gravidanze 0.04
## Fumatrici -0.01
## Gestazione 0.46
## Peso 0.70
## Lunghezza 0.60
## Cranio 1.00
panel.cor <- function(x, y, digits = 2, prefix = "", cex.cor, ...)
{
par(usr = c(0, 1, 0, 1))
r <- abs(cor(x, y))
txt <- format(c(r, 0.123456789), digits = digits)[1]
txt <- paste0(prefix, txt)
if(missing(cex.cor)) cex.cor <- 0.8/strwidth(txt)
text(0.5, 0.5, txt, cex = cex.cor * r)
}
pairs(numeric_matrix,upper.panel = panel.smooth,lower.panel = panel.cor)
Before modeling, we examined the response variable Peso. The
distribution is slightly left-skewed (skewness = –0.65) and peaked
(kurtosis near 2), but acceptable for linear regression given the large
sample size (n = 2500).
The Shapiro-Wilk test was significant (p-value near 0), but this is
expected with large datasets.
We also computed the correlation matrix. Lunghezza,Cranio and
Gestazione show strong correlations with Peso (r = 0.80, 0.70, 0.59
respectively) and moderate correlation with each other, suggesting some
possible multicollinearity.
All predictors will be included in the initial model and evaluated
further.
# Build full linear regression model
mod1 = lm(Peso ~ Anni.madre + N.gravidanze + Fumatrici + Gestazione +
Lunghezza + Cranio + Tipo.parto + Ospedale + Sesso,
data = data)
summary(mod1)
##
## Call:
## lm(formula = Peso ~ Anni.madre + N.gravidanze + Fumatrici + Gestazione +
## Lunghezza + Cranio + Tipo.parto + Ospedale + Sesso, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1124.40 -181.66 -14.42 160.91 2611.89
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6738.4762 141.3087 -47.686 < 2e-16 ***
## Anni.madre 0.8921 1.1323 0.788 0.4308
## N.gravidanze 11.2665 4.6608 2.417 0.0157 *
## Fumatrici -30.1631 27.5386 -1.095 0.2735
## Gestazione 32.5696 3.8187 8.529 < 2e-16 ***
## Lunghezza 10.2945 0.3007 34.236 < 2e-16 ***
## Cranio 10.4707 0.4260 24.578 < 2e-16 ***
## Tipo.partoNat 29.5254 12.0844 2.443 0.0146 *
## Ospedaleosp2 -11.2095 13.4379 -0.834 0.4043
## Ospedaleosp3 28.0958 13.4957 2.082 0.0375 *
## SessoM 77.5409 11.1776 6.937 5.08e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 273.9 on 2489 degrees of freedom
## Multiple R-squared: 0.7289, Adjusted R-squared: 0.7278
## F-statistic: 669.2 on 10 and 2489 DF, p-value: < 2.2e-16
From this model we can observe that the variables Anni.madre e Fumatrici are not significant, so before doing studying other insights we are going to build a second model without them.
mod2= lm(Peso ~ N.gravidanze + Gestazione +
Lunghezza + Cranio + Tipo.parto + Ospedale + Sesso,
data = data)
summary(mod2)
##
## Call:
## lm(formula = Peso ~ N.gravidanze + Gestazione + Lunghezza + Cranio +
## Tipo.parto + Ospedale + Sesso, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1113.18 -181.16 -16.58 161.01 2620.19
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6707.4293 135.9438 -49.340 < 2e-16 ***
## N.gravidanze 12.3619 4.3325 2.853 0.00436 **
## Gestazione 31.9909 3.7896 8.442 < 2e-16 ***
## Lunghezza 10.3086 0.3004 34.316 < 2e-16 ***
## Cranio 10.4922 0.4254 24.661 < 2e-16 ***
## Tipo.partoNat 29.2803 12.0817 2.424 0.01544 *
## Ospedaleosp2 -11.0227 13.4363 -0.820 0.41209
## Ospedaleosp3 28.6408 13.4886 2.123 0.03382 *
## SessoM 77.4412 11.1756 6.930 5.36e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 273.9 on 2491 degrees of freedom
## Multiple R-squared: 0.7287, Adjusted R-squared: 0.7278
## F-statistic: 836.3 on 8 and 2491 DF, p-value: < 2.2e-16
In Model 2, we removed the non-significant variables Anni.madre and
Fumatrici from the full model.
The model has good explanatory power (Adjusted R² = 0.728) and all
included predictors remain interpretable and statistically valid.
Before using this model we want to test if removing less significant variables changes the perfomances of the model, so we are going to generate a third model without N. gravidanza and Ospedale variables.
mod3=lm(Peso ~ Gestazione + Lunghezza + Cranio + Tipo.parto + Sesso,
data = data)
summary(mod3)
##
## Call:
## lm(formula = Peso ~ Gestazione + Lunghezza + Cranio + Tipo.parto +
## Sesso, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1118.43 -185.56 -16.07 161.53 2626.18
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6675.8084 135.7769 -49.167 < 2e-16 ***
## Gestazione 31.1917 3.7821 8.247 2.59e-16 ***
## Lunghezza 10.2412 0.3008 34.049 < 2e-16 ***
## Cranio 10.6397 0.4242 25.080 < 2e-16 ***
## Tipo.partoNat 29.1062 12.1113 2.403 0.0163 *
## SessoM 79.0670 11.2010 7.059 2.17e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 274.7 on 2494 degrees of freedom
## Multiple R-squared: 0.7267, Adjusted R-squared: 0.7262
## F-statistic: 1326 on 5 and 2494 DF, p-value: < 2.2e-16
#Use VIF function from car library to evaluate multicollinearity problems.
vif(mod3)
## Gestazione Lunghezza Cranio Tipo.parto Sesso
## 1.653637 2.074593 1.607582 1.002755 1.038816
In Model 3, we further removed N.gravidanze and Ospedale, which
showed limited contribution.
This simplified model still performs comparably (Adjusted R² = 0.726)
and includes only core clinical predictors of birth weight. We also
started to use the VIF test to evaluate multicollinearity, in this case
since all coefficent are below 5 we do not have that kind of
problem.
Before officializing this model we want to do a last test without tipo.parto variable, aiming to simplify the model a little more.
mod4=lm(Peso ~ Gestazione + Lunghezza + Cranio + Sesso,
data = data)
summary(mod4)
##
## Call:
## lm(formula = Peso ~ Gestazione + Lunghezza + Cranio + Sesso,
## data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1138.2 -184.3 -17.6 163.3 2627.3
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6651.1188 135.5172 -49.080 < 2e-16 ***
## Gestazione 31.2737 3.7856 8.261 2.31e-16 ***
## Lunghezza 10.2054 0.3007 33.939 < 2e-16 ***
## Cranio 10.6704 0.4245 25.139 < 2e-16 ***
## SessoM 79.1049 11.2117 7.056 2.22e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 275 on 2495 degrees of freedom
## Multiple R-squared: 0.7261, Adjusted R-squared: 0.7257
## F-statistic: 1654 on 4 and 2495 DF, p-value: < 2.2e-16
#Use VIF function from car library to evaluate multicollinearity problems.
vif(mod4)
## Gestazione Lunghezza Cranio Sesso
## 1.653502 2.069517 1.606131 1.038813
In the final model, we excluded Tipo.parto to further simplify the
structure.
The model includes only four core clinical predictors (Gestazione,
Lunghezza, Cranio, and Sesso), all of which are highly significant.
Model performance remains excellent (Adjusted R² = 0.7257, Residual
SE ≈ 275 g), and no multicollinearity issues are present (VIF
coefficient < 2.1).
This version balances predictive accuracy and parsimony, and is selected
as the final model.
In addiction We observed a non-linear relationship between maternal
age and number of pregnancies, resembling a logarithmic pattern.
However, as neither variable was a significant predictor of birth
weight, this interaction was not included in the final model.
As a last step we are going to use the BIC test to evaluate the models.
BIC(mod1,mod2,mod3,mod4)
## df BIC
## mod1 12 35241.84
## mod2 10 35228.03
## mod3 7 35222.59
## mod4 6 35220.54
To support model selection, we compared BIC values across all candidate models. The final model (mod4) had the lowest BIC (630.62), confirming it as the most efficient and parsimonious solution.
# Predict values from the final model
predicted= predict(mod4)
# Calculate residuals
residuals_value = data$Peso - predicted
# Calculate RMSE
rmse = sqrt(mean(residuals_value^2))
rmse
## [1] 274.728
The final regression model achieves an Adjusted R² of 0.7257, indicating that approximately 73% of the variance in birth weight is explained by the predictors.
The Root Mean Squared Error (RMSE) is 275 grams, the model’s predictions deviate from actual birth weights by about 275 grams.
First we take a look to give a graphic interpretation of the results.
par(mfrow=c(2,2))
plot(mod4)
Residual values are distributed around a mean of zero without showing any clear pattern — confirming the assumption of linearity. There are some outliers around 2000 g, like obs 1551.
The residuals follow the diagonal in the QQ-plot, indicating that they are approximately normally distributed. Also here we can see obs 1551 as an outlier.
Residual variance looks stable
Most points have low leveragem, here obs 1551 stands out (high residual + leverage, it is also near the cook’s threshold line).
shapiro.test(residuals(mod4))
##
## Shapiro-Wilk normality test
##
## data: residuals(mod4)
## W = 0.9742, p-value < 2.2e-16
lmtest::bptest(mod4)
##
## studentized Breusch-Pagan test
##
## data: mod4
## BP = 89.148, df = 4, p-value < 2.2e-16
lmtest::dwtest(mod4)
##
## Durbin-Watson test
##
## data: mod4
## DW = 1.9557, p-value = 0.1337
## alternative hypothesis: true autocorrelation is greater than 0
We applied three diagnostic tests to evaluate model assumptions:
ggplot(data, aes(x = Gestazione, y = Peso, color = Sesso)) +
geom_point(alpha = 0.5,
position="jitter")+
geom_smooth(method = "lm", se = FALSE) +
labs(title = "Birth Weight vs Gestational Weeks by Sex",
x = "Gestational Weeks",
y = "Birth Weight (g)") +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
# Plot predicted weight vs Cranio
ggplot(data, aes(x = Cranio, y = predicted))+
geom_point(alpha = 0.5, color = "steelblue") +
geom_smooth(method = "lm", se = FALSE, color = "darkblue") +
labs(title = "Predicted Birth Weight by Cranial Diameter",
x = "Cranial Diameter (mm)",
y = "Predicted Birth Weight (g)") +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
new_baby=data.frame(
Gestazione = 39,
Lunghezza = 500,
Cranio = 340,
Sesso = "F" )
predict(mod4, newdata = new_baby)
## 1
## 3299.19
To demonstrate the use of the model, we estimate the birth weight of a female baby with the following profile:
Using the final regression model, the predicted birth weight is approximately 3299 grams.
Comments on the Visualizations:
This plot shows a clear upward trend in birth weight as gestational weeks increase. The two regression lines (male and female) are nearly parallel, with males consistently above females. This confirms what the model detected: gestational age and sex both significantly influence birth weight.
This plot illustrates a strong positive linear relationship between cranial diameter and predicted birth weight. The trend is consistent and shows no major deviation, confirming the linearity assumption. It visually validates that cranial diameter is one of the strongest predictors in the final model
This project developed and validated a statistical model to predict neonatal birth weight using clinical variables. After testing multiple hypotheses and evaluating different regression models, the final model included:
Gestational Weeks
Neonatal Length
Cranial Diameter
Sex
The model explains approximately 73% of the variance in birth weight (Adjusted R² = 0.726) with a low RMSE of 275 grams, indicating strong predictive performance.
Diagnostic tests and residual plots confirmed that model assumptions were reasonably satisfied, with no critical outliers or violations.