library(ggpubr)

## Caricamento del pacchetto richiesto: ggplot2

library(moments)
library(car)

## Caricamento del pacchetto richiesto: carData

library(lmtest)

## Caricamento del pacchetto richiesto: zoo

## 
## Caricamento pacchetto: 'zoo'

## I seguenti oggetti sono mascherati da 'package:base':
## 
##     as.Date, as.Date.numeric

This project aims to build a statistical model to predict newborn birth weight, based on different clinical and demographic variables. Having this information at the time of birth represent a great opportunity to improve clinical planning and to reduce risks associated with preterm birth or low birth weight infants.

Step 1 - Dataset importation and statistics description

#Dataset importing
data = read.csv ("neonati.csv",stringsAsFactors=TRUE)
#Visualizing variables type and summarizing statistics values
str(data)

## 'data.frame':    2500 obs. of  10 variables:
##  $ Anni.madre  : int  26 21 34 28 20 32 26 25 22 23 ...
##  $ N.gravidanze: int  0 2 3 1 0 0 1 0 1 0 ...
##  $ Fumatrici   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Gestazione  : int  42 39 38 41 38 40 39 40 40 41 ...
##  $ Peso        : int  3380 3150 3640 3690 3700 3200 3100 3580 3670 3700 ...
##  $ Lunghezza   : int  490 490 500 515 480 495 480 510 500 510 ...
##  $ Cranio      : int  325 345 375 365 335 340 345 349 335 362 ...
##  $ Tipo.parto  : Factor w/ 2 levels "Ces","Nat": 2 2 2 2 2 2 2 2 1 1 ...
##  $ Ospedale    : Factor w/ 3 levels "osp1","osp2",..: 3 1 2 2 3 2 3 1 2 2 ...
##  $ Sesso       : Factor w/ 2 levels "F","M": 2 1 2 2 1 1 1 2 1 1 ...

summary(data)

##    Anni.madre     N.gravidanze       Fumatrici        Gestazione   
##  Min.   : 0.00   Min.   : 0.0000   Min.   :0.0000   Min.   :25.00  
##  1st Qu.:25.00   1st Qu.: 0.0000   1st Qu.:0.0000   1st Qu.:38.00  
##  Median :28.00   Median : 1.0000   Median :0.0000   Median :39.00  
##  Mean   :28.16   Mean   : 0.9812   Mean   :0.0416   Mean   :38.98  
##  3rd Qu.:32.00   3rd Qu.: 1.0000   3rd Qu.:0.0000   3rd Qu.:40.00  
##  Max.   :46.00   Max.   :12.0000   Max.   :1.0000   Max.   :43.00  
##       Peso        Lunghezza         Cranio    Tipo.parto Ospedale   Sesso   
##  Min.   : 830   Min.   :310.0   Min.   :235   Ces: 728   osp1:816   F:1256  
##  1st Qu.:2990   1st Qu.:480.0   1st Qu.:330   Nat:1772   osp2:849   M:1244  
##  Median :3300   Median :500.0   Median :340              osp3:835           
##  Mean   :3284   Mean   :494.7   Mean   :340                                 
##  3rd Qu.:3620   3rd Qu.:510.0   3rd Qu.:350                                 
##  Max.   :4930   Max.   :565.0   Max.   :390

We start by importing the dataset, checking if all the variables have the expected type and visualize a brief summary of the data. During this step we also converted string variables to factors. From the summary() output, we observe that variables such as Ospedale, Sesso, and Tipo_Parto are correctly recognized as factors, each with a limited number of distinct levels. Now we explore the dataset using descriptive statistics.

# We start by creating a vector containing only numeric variables in data
numeric_var = sapply(data, is.numeric)
summary(data[,numeric_var])

##    Anni.madre     N.gravidanze       Fumatrici        Gestazione   
##  Min.   : 0.00   Min.   : 0.0000   Min.   :0.0000   Min.   :25.00  
##  1st Qu.:25.00   1st Qu.: 0.0000   1st Qu.:0.0000   1st Qu.:38.00  
##  Median :28.00   Median : 1.0000   Median :0.0000   Median :39.00  
##  Mean   :28.16   Mean   : 0.9812   Mean   :0.0416   Mean   :38.98  
##  3rd Qu.:32.00   3rd Qu.: 1.0000   3rd Qu.:0.0000   3rd Qu.:40.00  
##  Max.   :46.00   Max.   :12.0000   Max.   :1.0000   Max.   :43.00  
##       Peso        Lunghezza         Cranio   
##  Min.   : 830   Min.   :310.0   Min.   :235  
##  1st Qu.:2990   1st Qu.:480.0   1st Qu.:330  
##  Median :3300   Median :500.0   Median :340  
##  Mean   :3284   Mean   :494.7   Mean   :340  
##  3rd Qu.:3620   3rd Qu.:510.0   3rd Qu.:350  
##  Max.   :4930   Max.   :565.0   Max.   :390

# We analyze mean and standard deviation for each numeric variable
sapply(data[,numeric_var],mean)

##   Anni.madre N.gravidanze    Fumatrici   Gestazione         Peso    Lunghezza 
##      28.1640       0.9812       0.0416      38.9804    3284.0808     494.6920 
##       Cranio 
##     340.0292

sapply(data[,numeric_var],sd)

##   Anni.madre N.gravidanze    Fumatrici   Gestazione         Peso    Lunghezza 
##    5.2735783    1.2805868    0.1997133    1.8686392  525.0387442   26.3186437 
##       Cranio 
##   16.4253299

#Along with numeric values we will explore variancy with boxplots
categorical_variables = c("Sesso","Tipo.parto","Ospedale")
for (var in categorical_variables){
  formula = reformulate(var,response ="Peso")
  
  #Create boxplot
  boxplot(formula,
          data=data,
          col="lightblue4",
          main=paste("Birth weight by",var),
          xlab=var,
          ylab="Birth weight [grams]")
}

From this first look the dataset seems balanced, we did not found any clear anomolies yet or extreme variance. Before moving on with hypothesis testing we will give a rapid look at the connection between Peso and the two variables that are suggested as influent, Fumatrici and Gestazione.

boxplot(Peso ~ Fumatrici, data = data,
        main = "Birth Weight by Maternal Smoking",
        col = "red4", 
        xlab = "Smoking (0 = No, 1 = Yes)", 
        ylab = "Birth Weight [grams]")

#Pearson correlation coefficient 
cor(data$Peso,data$Gestazione)

## [1] 0.5917687

As highlighted in the project text we explore the influence of maternal smoking and gestational duration on birth weight. From the boxplot we can observe that infants born to smoking mother tends to have a slight lower median weight and less variability. To confirm that the variable “Fumo” might help in predcting low weight newborn we need to test its significancy, since the dataset contains way more non smoking mothers.

The pearson coefficient shows a quite strong positive relation between weight and gestational duration as we expected.

Step 2 - Hyphotesis testing

Hypothesis 1 - Cesarean rate vary across hospital

We need to test if two categorical variables are associated, we will use the chi-square test. H0: Tipo.parto and Ospedale are independent. H1: They are not independent.

# Build contingency table
parto_hosp_table = table(data$Tipo.parto, data$Ospedale)

ggpubr::ggballoonplot(data=as.data.frame(parto_hosp_table),
                      fill="green4")

chisq.test(parto_hosp_table)

## 
##  Pearson's Chi-squared test
## 
## data:  parto_hosp_table
## X-squared = 1.0972, df = 2, p-value = 0.5778

From the baloon graph we can notice that there is not much difference across hospitals, the high p-value also confirms us that we do not refuse H0 and that the observed differences between the cesarean rates between hospital could easily be due to randomness.

Hypothesis 2 - Sample means = Population means

We are going to test if the average Peso and Lunghezza of the dataset is consistent with population averages. To do that we will use a t.test comparing the mean of Peso and lunghezza to the mean value of population. We assume reference values of:

Birth weight: 3200 grams

source :Stanford Children’s Health

Length: 500 millimeters

source: Medicalnews article

# Birth weight
t.test(data$Peso, mu = 3200)

## 
##  One Sample t-test
## 
## data:  data$Peso
## t = 8.0071, df = 2499, p-value = 1.782e-15
## alternative hypothesis: true mean is not equal to 3200
## 95 percent confidence interval:
##  3263.490 3304.672
## sample estimates:
## mean of x 
##  3284.081

#evaluate threshold values with  5% significancy e n-1 elements
qt(0.025,2499)

## [1] -1.960914

-qt(0.025,2499)

## [1] 1.960914

# Length
t.test(data$Lunghezza, mu = 500)

## 
##  One Sample t-test
## 
## data:  data$Lunghezza
## t = -10.084, df = 2499, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 500
## 95 percent confidence interval:
##  493.6598 495.7242
## sample estimates:
## mean of x 
##   494.692

Peso

Given the results we refuse H0, since the p-value is below 0.05, the population mean falls outside the confidence interval and the t value of 8.01 is outside the range of threshold value (-1,96,1,96 for a two tailed t test with 2499 degrees of freedom and alfa = 0.05) Our sample mean is significantly higher than the population mean.

Lunghezza

Given the results we refuse H0, since the p-value is below 0.05, the population mean falls outside the confidence interval and the t value of -10.084 is outside the range of threshold value (-1,96,1,96 for a two tailed t test with 2499 degrees of freedom and alfa = 0.05) Our sample mean is significantly lower than the population mean.

Hypothesis 3 - Anthropometric measures by Sex

We are going to test if male and female newborns have significantly different means for: - Peso - Lunghezza - Cranio

We consider H0: mu_var_F = mu_var_M

# Compare anthropometric measures by sex

# Weight
t.test(Peso ~ Sesso, data = data)

## 
##  Welch Two Sample t-test
## 
## data:  Peso by Sesso
## t = -12.106, df = 2490.7, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group F and group M is not equal to 0
## 95 percent confidence interval:
##  -287.1051 -207.0615
## sample estimates:
## mean in group F mean in group M 
##        3161.132        3408.215

# Length
t.test(Lunghezza ~ Sesso, data = data)

## 
##  Welch Two Sample t-test
## 
## data:  Lunghezza by Sesso
## t = -9.582, df = 2459.3, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group F and group M is not equal to 0
## 95 percent confidence interval:
##  -11.929470  -7.876273
## sample estimates:
## mean in group F mean in group M 
##        489.7643        499.6672

# Cranial diameter
t.test(Cranio ~ Sesso, data = data)

## 
##  Welch Two Sample t-test
## 
## data:  Cranio by Sesso
## t = -7.4102, df = 2491.4, p-value = 1.718e-13
## alternative hypothesis: true difference in means between group F and group M is not equal to 0
## 95 percent confidence interval:
##  -6.089912 -3.541270
## sample estimates:
## mean in group F mean in group M 
##        337.6330        342.4486

In all cases, the p-values were far below 0.05 (nearly 0), and the confidence intervals did not include zero. So we reject H0 for all three variables, concluding that male and female newborns differ significantly in weight, length, and cranial diameter. Males consistently exhibit higher mean values across all measures.

Step 3 - Creating a regression model

We are going to start with a model that includes all the variables, explore all the coefficient and only then choose the best model.

n = nrow(data)
plot(density(data$Peso))

#Form index
moments::skewness(data$Peso)

## [1] -0.6470308

moments::kurtosis(data$Peso)-3

## [1] 2.031532

#SHAPIRO TEST
shapiro.test(data$Peso)

## 
##  Shapiro-Wilk normality test
## 
## data:  data$Peso
## W = 0.97066, p-value < 2.2e-16

#Correlation Matrix
numeric_matrix = data[, sapply(data, is.numeric)]
correlation_matrix = cor(numeric_matrix)
round(correlation_matrix,2)

##              Anni.madre N.gravidanze Fumatrici Gestazione  Peso Lunghezza
## Anni.madre         1.00         0.38      0.01      -0.14 -0.02     -0.06
## N.gravidanze       0.38         1.00      0.05      -0.10  0.00     -0.06
## Fumatrici          0.01         0.05      1.00       0.03 -0.02     -0.02
## Gestazione        -0.14        -0.10      0.03       1.00  0.59      0.62
## Peso              -0.02         0.00     -0.02       0.59  1.00      0.80
## Lunghezza         -0.06        -0.06     -0.02       0.62  0.80      1.00
## Cranio             0.02         0.04     -0.01       0.46  0.70      0.60
##              Cranio
## Anni.madre     0.02
## N.gravidanze   0.04
## Fumatrici     -0.01
## Gestazione     0.46
## Peso           0.70
## Lunghezza      0.60
## Cranio         1.00

panel.cor <- function(x, y, digits = 2, prefix = "", cex.cor, ...)
{
  par(usr = c(0, 1, 0, 1))
  r <- abs(cor(x, y))
  txt <- format(c(r, 0.123456789), digits = digits)[1]
  txt <- paste0(prefix, txt)
  if(missing(cex.cor)) cex.cor <- 0.8/strwidth(txt)
  text(0.5, 0.5, txt, cex = cex.cor * r)
}

pairs(numeric_matrix,upper.panel = panel.smooth,lower.panel = panel.cor)

Before modeling, we examined the response variable Peso. The distribution is slightly left-skewed (skewness = –0.65) and peaked (kurtosis near 2), but acceptable for linear regression given the large sample size (n = 2500).
The Shapiro-Wilk test was significant (p-value near 0), but this is expected with large datasets.

We also computed the correlation matrix. Lunghezza,Cranio and Gestazione show strong correlations with Peso (r = 0.80, 0.70, 0.59 respectively) and moderate correlation with each other, suggesting some possible multicollinearity.
All predictors will be included in the initial model and evaluated further.

# Build full linear regression model
mod1 = lm(Peso ~ Anni.madre + N.gravidanze + Fumatrici + Gestazione +
                 Lunghezza + Cranio + Tipo.parto + Ospedale + Sesso,
          data = data)

summary(mod1)

## 
## Call:
## lm(formula = Peso ~ Anni.madre + N.gravidanze + Fumatrici + Gestazione + 
##     Lunghezza + Cranio + Tipo.parto + Ospedale + Sesso, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1124.40  -181.66   -14.42   160.91  2611.89 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -6738.4762   141.3087 -47.686  < 2e-16 ***
## Anni.madre        0.8921     1.1323   0.788   0.4308    
## N.gravidanze     11.2665     4.6608   2.417   0.0157 *  
## Fumatrici       -30.1631    27.5386  -1.095   0.2735    
## Gestazione       32.5696     3.8187   8.529  < 2e-16 ***
## Lunghezza        10.2945     0.3007  34.236  < 2e-16 ***
## Cranio           10.4707     0.4260  24.578  < 2e-16 ***
## Tipo.partoNat    29.5254    12.0844   2.443   0.0146 *  
## Ospedaleosp2    -11.2095    13.4379  -0.834   0.4043    
## Ospedaleosp3     28.0958    13.4957   2.082   0.0375 *  
## SessoM           77.5409    11.1776   6.937 5.08e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 273.9 on 2489 degrees of freedom
## Multiple R-squared:  0.7289, Adjusted R-squared:  0.7278 
## F-statistic: 669.2 on 10 and 2489 DF,  p-value: < 2.2e-16

From this model we can observe that the variables Anni.madre e Fumatrici are not significant, so before doing studying other insights we are going to build a second model without them.

mod2= lm(Peso ~ N.gravidanze + Gestazione +
                 Lunghezza + Cranio + Tipo.parto + Ospedale + Sesso,
          data = data)
summary(mod2)

## 
## Call:
## lm(formula = Peso ~ N.gravidanze + Gestazione + Lunghezza + Cranio + 
##     Tipo.parto + Ospedale + Sesso, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1113.18  -181.16   -16.58   161.01  2620.19 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -6707.4293   135.9438 -49.340  < 2e-16 ***
## N.gravidanze     12.3619     4.3325   2.853  0.00436 ** 
## Gestazione       31.9909     3.7896   8.442  < 2e-16 ***
## Lunghezza        10.3086     0.3004  34.316  < 2e-16 ***
## Cranio           10.4922     0.4254  24.661  < 2e-16 ***
## Tipo.partoNat    29.2803    12.0817   2.424  0.01544 *  
## Ospedaleosp2    -11.0227    13.4363  -0.820  0.41209    
## Ospedaleosp3     28.6408    13.4886   2.123  0.03382 *  
## SessoM           77.4412    11.1756   6.930 5.36e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 273.9 on 2491 degrees of freedom
## Multiple R-squared:  0.7287, Adjusted R-squared:  0.7278 
## F-statistic: 836.3 on 8 and 2491 DF,  p-value: < 2.2e-16

In Model 2, we removed the non-significant variables Anni.madre and Fumatrici from the full model.
The model has good explanatory power (Adjusted R² = 0.728) and all included predictors remain interpretable and statistically valid.

Before using this model we want to test if removing less significant variables changes the perfomances of the model, so we are going to generate a third model without N. gravidanza and Ospedale variables.

mod3=lm(Peso ~ Gestazione + Lunghezza + Cranio + Tipo.parto + Sesso,
          data = data)
summary(mod3)

## 
## Call:
## lm(formula = Peso ~ Gestazione + Lunghezza + Cranio + Tipo.parto + 
##     Sesso, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1118.43  -185.56   -16.07   161.53  2626.18 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -6675.8084   135.7769 -49.167  < 2e-16 ***
## Gestazione       31.1917     3.7821   8.247 2.59e-16 ***
## Lunghezza        10.2412     0.3008  34.049  < 2e-16 ***
## Cranio           10.6397     0.4242  25.080  < 2e-16 ***
## Tipo.partoNat    29.1062    12.1113   2.403   0.0163 *  
## SessoM           79.0670    11.2010   7.059 2.17e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 274.7 on 2494 degrees of freedom
## Multiple R-squared:  0.7267, Adjusted R-squared:  0.7262 
## F-statistic:  1326 on 5 and 2494 DF,  p-value: < 2.2e-16

#Use VIF function from car library to evaluate multicollinearity problems.
vif(mod3)

## Gestazione  Lunghezza     Cranio Tipo.parto      Sesso 
##   1.653637   2.074593   1.607582   1.002755   1.038816

In Model 3, we further removed N.gravidanze and Ospedale, which showed limited contribution.
This simplified model still performs comparably (Adjusted R² = 0.726) and includes only core clinical predictors of birth weight. We also started to use the VIF test to evaluate multicollinearity, in this case since all coefficent are below 5 we do not have that kind of problem.

Before officializing this model we want to do a last test without tipo.parto variable, aiming to simplify the model a little more.

mod4=lm(Peso ~ Gestazione + Lunghezza + Cranio + Sesso,
          data = data)
summary(mod4)

## 
## Call:
## lm(formula = Peso ~ Gestazione + Lunghezza + Cranio + Sesso, 
##     data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1138.2  -184.3   -17.6   163.3  2627.3 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -6651.1188   135.5172 -49.080  < 2e-16 ***
## Gestazione     31.2737     3.7856   8.261 2.31e-16 ***
## Lunghezza      10.2054     0.3007  33.939  < 2e-16 ***
## Cranio         10.6704     0.4245  25.139  < 2e-16 ***
## SessoM         79.1049    11.2117   7.056 2.22e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 275 on 2495 degrees of freedom
## Multiple R-squared:  0.7261, Adjusted R-squared:  0.7257 
## F-statistic:  1654 on 4 and 2495 DF,  p-value: < 2.2e-16

#Use VIF function from car library to evaluate multicollinearity problems.
vif(mod4)

## Gestazione  Lunghezza     Cranio      Sesso 
##   1.653502   2.069517   1.606131   1.038813

In the final model, we excluded Tipo.parto to further simplify the structure.
The model includes only four core clinical predictors (Gestazione, Lunghezza, Cranio, and Sesso), all of which are highly significant.

Model performance remains excellent (Adjusted R² = 0.7257, Residual SE ≈ 275 g), and no multicollinearity issues are present (VIF coefficient < 2.1).
This version balances predictive accuracy and parsimony, and is selected as the final model.

In addiction We observed a non-linear relationship between maternal age and number of pregnancies, resembling a logarithmic pattern.
However, as neither variable was a significant predictor of birth weight, this interaction was not included in the final model.

As a last step we are going to use the BIC test to evaluate the models.

BIC(mod1,mod2,mod3,mod4)

##      df      BIC
## mod1 12 35241.84
## mod2 10 35228.03
## mod3  7 35222.59
## mod4  6 35220.54

To support model selection, we compared BIC values across all candidate models. The final model (mod4) had the lowest BIC (630.62), confirming it as the most efficient and parsimonious solution.

Step 4 - Analysing model’s quality

R² and Rmse

# Predict values from the final model
predicted= predict(mod4)

# Calculate residuals
residuals_value = data$Peso - predicted

# Calculate RMSE
rmse = sqrt(mean(residuals_value^2))
rmse

## [1] 274.728

The final regression model achieves an Adjusted R² of 0.7257, indicating that approximately 73% of the variance in birth weight is explained by the predictors.

The Root Mean Squared Error (RMSE) is 275 grams, the model’s predictions deviate from actual birth weights by about 275 grams.

Residual Analysis

First we take a look to give a graphic interpretation of the results.

par(mfrow=c(2,2))
plot(mod4)

Residual values are distributed around a mean of zero without showing any clear pattern — confirming the assumption of linearity. There are some outliers around 2000 g, like obs 1551.
The residuals follow the diagonal in the QQ-plot, indicating that they are approximately normally distributed. Also here we can see obs 1551 as an outlier.
Residual variance looks stable
Most points have low leveragem, here obs 1551 stands out (high residual + leverage, it is also near the cook’s threshold line).

shapiro.test(residuals(mod4))

## 
##  Shapiro-Wilk normality test
## 
## data:  residuals(mod4)
## W = 0.9742, p-value < 2.2e-16

lmtest::bptest(mod4)

## 
##  studentized Breusch-Pagan test
## 
## data:  mod4
## BP = 89.148, df = 4, p-value < 2.2e-16

lmtest::dwtest(mod4)

## 
##  Durbin-Watson test
## 
## data:  mod4
## DW = 1.9557, p-value = 0.1337
## alternative hypothesis: true autocorrelation is greater than 0

We applied three diagnostic tests to evaluate model assumptions:

The Shapiro-Wilk test returned a significant result (p-value near 0) suggesting non-normal residuals. But given the large sample size and supportive QQ-plot, normality is considered acceptable.
The Breusch-Pagan test indicated some heteroscedasticity (p-value near 0, we refuse H0 of homoscedasticity), consistent with visual inspection. This may slightly impact standard errors but does not invalidate the model.
The Durbin-Watson test showed no evidence of autocorrelation in residuals (p-value of 0.13), confirming independence.

Step 5 - Conclusions, final visualization and comments

ggplot(data, aes(x = Gestazione, y = Peso, color = Sesso)) +
  geom_point(alpha = 0.5,
             position="jitter")+
  geom_smooth(method = "lm", se = FALSE) +
  labs(title = "Birth Weight vs Gestational Weeks by Sex",
       x = "Gestational Weeks",
       y = "Birth Weight (g)") +
  theme_minimal()

## `geom_smooth()` using formula = 'y ~ x'

# Plot predicted weight vs Cranio
ggplot(data, aes(x = Cranio, y = predicted))+
  geom_point(alpha = 0.5, color = "steelblue") +
  geom_smooth(method = "lm", se = FALSE, color = "darkblue") +
  labs(title = "Predicted Birth Weight by Cranial Diameter",
       x = "Cranial Diameter (mm)",
       y = "Predicted Birth Weight (g)") +
  theme_minimal()

## `geom_smooth()` using formula = 'y ~ x'

new_baby=data.frame(
  Gestazione = 39,
  Lunghezza = 500,
  Cranio = 340,
  Sesso = "F" )

predict(mod4, newdata = new_baby)

##       1 
## 3299.19

To demonstrate the use of the model, we estimate the birth weight of a female baby with the following profile:

Gestational Age: 39 weeks
Length: 500 mm
Cranial Diameter: 340 mm
Sex: Female

Using the final regression model, the predicted birth weight is approximately 3299 grams.

Comments on the Visualizations:

Birth Weight vs Gestational Weeks by Sex

This plot shows a clear upward trend in birth weight as gestational weeks increase. The two regression lines (male and female) are nearly parallel, with males consistently above females. This confirms what the model detected: gestational age and sex both significantly influence birth weight.

Predicted Birth Weight by Cranial Diameter

This plot illustrates a strong positive linear relationship between cranial diameter and predicted birth weight. The trend is consistent and shows no major deviation, confirming the linearity assumption. It visually validates that cranial diameter is one of the strongest predictors in the final model

This project developed and validated a statistical model to predict neonatal birth weight using clinical variables. After testing multiple hypotheses and evaluating different regression models, the final model included:

Gestational Weeks

Neonatal Length

Cranial Diameter

Sex

The model explains approximately 73% of the variance in birth weight (Adjusted R² = 0.726) with a low RMSE of 275 grams, indicating strong predictive performance.

Diagnostic tests and residual plots confirmed that model assumptions were reasonably satisfied, with no critical outliers or violations.

Statistical model for newborn weight prediction

Giulio Zerilli

2025-06-19