Installation of packages and import libraries

suppressPackageStartupMessages({
  library(zoo)
  library(dplyr)
  library(car)
})
library(sandwich) 
library(lmtest) 
library(ggplot2) 
library(knitr)

1. Data Collection and Dataset Structure

#Upload database 
dati<-read.csv("neonati.csv")

attach(dati)
print("summary dataset")
## [1] "summary dataset"
summary(dati)
##    Anni.madre     N.gravidanze       Fumatrici        Gestazione   
##  Min.   : 0.00   Min.   : 0.0000   Min.   :0.0000   Min.   :25.00  
##  1st Qu.:25.00   1st Qu.: 0.0000   1st Qu.:0.0000   1st Qu.:38.00  
##  Median :28.00   Median : 1.0000   Median :0.0000   Median :39.00  
##  Mean   :28.16   Mean   : 0.9812   Mean   :0.0416   Mean   :38.98  
##  3rd Qu.:32.00   3rd Qu.: 1.0000   3rd Qu.:0.0000   3rd Qu.:40.00  
##  Max.   :46.00   Max.   :12.0000   Max.   :1.0000   Max.   :43.00  
##       Peso        Lunghezza         Cranio     Tipo.parto       
##  Min.   : 830   Min.   :310.0   Min.   :235   Length:2500       
##  1st Qu.:2990   1st Qu.:480.0   1st Qu.:330   Class :character  
##  Median :3300   Median :500.0   Median :340   Mode  :character  
##  Mean   :3284   Mean   :494.7   Mean   :340                     
##  3rd Qu.:3620   3rd Qu.:510.0   3rd Qu.:350                     
##  Max.   :4930   Max.   :565.0   Max.   :390                     
##    Ospedale            Sesso          
##  Length:2500        Length:2500       
##  Class :character   Class :character  
##  Mode  :character   Mode  :character  
##                                       
##                                       
## 
print("numbers of anomaly Anni.madre's variables less than 10 years: ")
## [1] "numbers of anomaly Anni.madre's variables less than 10 years: "
sum(dati$Anni.madre < 10)
## [1] 2
dati$Anni.madre[dati$Anni.madre < 10] <- NA
summary(dati)
##    Anni.madre     N.gravidanze       Fumatrici        Gestazione   
##  Min.   :13.00   Min.   : 0.0000   Min.   :0.0000   Min.   :25.00  
##  1st Qu.:25.00   1st Qu.: 0.0000   1st Qu.:0.0000   1st Qu.:38.00  
##  Median :28.00   Median : 1.0000   Median :0.0000   Median :39.00  
##  Mean   :28.19   Mean   : 0.9812   Mean   :0.0416   Mean   :38.98  
##  3rd Qu.:32.00   3rd Qu.: 1.0000   3rd Qu.:0.0000   3rd Qu.:40.00  
##  Max.   :46.00   Max.   :12.0000   Max.   :1.0000   Max.   :43.00  
##  NA's   :2                                                         
##       Peso        Lunghezza         Cranio     Tipo.parto       
##  Min.   : 830   Min.   :310.0   Min.   :235   Length:2500       
##  1st Qu.:2990   1st Qu.:480.0   1st Qu.:330   Class :character  
##  Median :3300   Median :500.0   Median :340   Mode  :character  
##  Mean   :3284   Mean   :494.7   Mean   :340                     
##  3rd Qu.:3620   3rd Qu.:510.0   3rd Qu.:350                     
##  Max.   :4930   Max.   :565.0   Max.   :390                     
##                                                                 
##    Ospedale            Sesso          
##  Length:2500        Length:2500       
##  Class :character   Class :character  
##  Mode  :character   Mode  :character  
##                                       
##                                       
##                                       
## 
skew <- moments::skewness(Peso) 

kur <- moments::kurtosis(Peso)-3 

df <- data.frame(Skewness = skew, Kurtosis = kur)
kable(df)
Skewness Kurtosis
-0.6470308 2.031532
shapiro.test(Peso )
## 
##  Shapiro-Wilk normality test
## 
## data:  Peso
## W = 0.97066, p-value < 2.2e-16

During the initial exploratory analysis of the dataset, two outliers were identified in the ‘Mother’s age’ variable, both less than 10, which is an implausible value. To ensure data quality, these values were replaced with NA. It is chosen to replace outliers with NA instead of deleting them, in order to maintain the integrity of the dataset and allow transparent management of missing data during subsequent analyses.

The results of the analysis of the variable output peso are as follows:

1.2 Description of variables:

  • Anni.madre: Continuous quantitative variable on a ratio scale, indicating the mother’s age in years.

  • N.gravidanze: Discrete quantitative variable representing the number of previous pregnancies of the mother.

  • Fumatrici: Binary qualitative variable (0/1) indicating whether the mother is a smoker (1) or not (0). It is a nominal categorical variable with only two categories.

  • Gestazione: Continuous quantitative variable on a ratio scale, measures the duration of pregnancy in weeks, crucial for assessing the health of the newborn and prematurity.

  • Peso: Continuous quantitative variable on a ratio scale, represents the weight of the newborn at birth, an important indicator of neonatal health.

  • Lunghezza: Continuous quantitative variable, indicates the length of the newborn in centimetres, another parameter of physical development at birth.

  • Canio: Continuous quantitative variable, measures the newborn’s head circumference, an anthropometric indicator.

  • Tipo.parto: Nominal qualitative variable, represents the category of delivery type (e.g. natural, caesarean).

  • Ospedale: Nominal qualitative variable, indicates the hospital where the birth took place, a possible source of variability in the data.

  • Sex: Nominal qualitative variable, specifies the sex of the newborn (male/female), used as a control variable.

Relationship between the weight of newborns and variables of the dataset

panel.cor <- function(x, y, digits = 2, prefix = "", cex.cor, ...) {
  usr <- par("usr")
  on.exit(par(usr = usr))
  plot.window(xlim = c(0,1), ylim = c(0,1))
  r <- cor(x, y, use = "complete.obs")
  txt <- format(c(r, 1), digits = digits)[1]
  txt <- paste0(prefix, txt)
  if(missing(cex.cor)) 
    cex.cor <- 0.8 / strwidth(txt)
  text(0.5, 0.5, txt, cex = cex.cor)
}

pairs(dati[sapply(dati, is.numeric)], lower.panel=panel.cor, upper.panel=panel.smooth)

Moderate positive correlation between “Gestazione” e “Peso”. A scatterplot that is similar to a logarithmic function. This suggests that weight gain may be faster in the early stages (weeks of gestation) and then slow down.

Relationship with control variables

par(mfrow=c(1,2)) 
boxplot(Peso) 
boxplot(Peso~Sesso) 

boxplot(Peso~Fumatrici)

The presence of numerous outliers could indicate significant weight variability within the groups.

Verify whether the observed average difference is statistically significant.

t.test(Peso ~ Fumatrici, data = dati) 
## 
##  Welch Two Sample t-test
## 
## data:  Peso by Fumatrici
## t = 1.034, df = 114.1, p-value = 0.3033
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
##  -45.61354 145.22674
## sample estimates:
## mean in group 0 mean in group 1 
##        3286.153        3236.346

There is a tendency towards a difference in the average weight of newborns between children of mothers who smoke and those who do not, with average values of 3236 g and 3286 g respectively. However, this difference was not statistically significant (t = 1.034, p = 0.3044, t-test), indicating that with the sample available, it is not possible to confirm a real difference. The t-test was considered appropriate given the large sample size, despite the non-normality of the Weight variable.

The test involving the other control variable (Sesso) is shown in Chapter 2.1.1 below.

2. Analysis and Modelling

2.1 Preliminary Analysis

2.1.1 Analysis of the relationship between ‘peso’ and variables (“Lungghezza”, “Peso” and “Cranio”).

par(mfrow=c(1,3))
boxplot(Peso~Sesso) 
boxplot(Lunghezza~Sesso) 
boxplot(Cranio~Sesso)

t.test(Peso ~ Sesso, data = dati) 
## 
##  Welch Two Sample t-test
## 
## data:  Peso by Sesso
## t = -12.106, df = 2490.7, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group F and group M is not equal to 0
## 95 percent confidence interval:
##  -287.1051 -207.0615
## sample estimates:
## mean in group F mean in group M 
##        3161.132        3408.215
t.test(Lunghezza ~ Sesso, data = dati) 
## 
##  Welch Two Sample t-test
## 
## data:  Lunghezza by Sesso
## t = -9.582, df = 2459.3, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group F and group M is not equal to 0
## 95 percent confidence interval:
##  -11.929470  -7.876273
## sample estimates:
## mean in group F mean in group M 
##        489.7643        499.6672
t.test(Cranio ~ Sesso, data = dati)
## 
##  Welch Two Sample t-test
## 
## data:  Cranio by Sesso
## t = -7.4102, df = 2491.4, p-value = 1.718e-13
## alternative hypothesis: true difference in means between group F and group M is not equal to 0
## 95 percent confidence interval:
##  -6.089912 -3.541270
## sample estimates:
## mean in group F mean in group M 
##        337.6330        342.4486

The difference in the weight, length and dimensions of the skulls of males and females is statistically significant.

2.1.2 In some hospitals, more caesarean sections are performed.

H0: the proportion of caesarean sections is the same in all hospitals

tabella <- table(Ospedale, Tipo.parto) 
print(tabella)
##         Tipo.parto
## Ospedale Ces Nat
##     osp1 242 574
##     osp2 254 595
##     osp3 232 603
test_chi <- chisq.test(tabella) 

print(test_chi) 
## 
##  Pearson's Chi-squared test
## 
## data:  tabella
## X-squared = 1.0972, df = 2, p-value = 0.5778

There is insufficient statistical evidence to claim that the proportion of caesarean sections differs between the three hospitals in the sample.

2.1.3 The average weight and length of the sample of newborns are significantly equal to those of the population.

Mean weight population newborns from OMS:

mu_m_weight <- 3346.4 #g
mu_f_weight <- 3232.2 #g 
mu_m_length <- 498.84 #millimeters 
mu_f_length <- 491.48 #millimeters
dati_m <- subset(dati, Sesso == "M") 
dati_f <- subset(dati, Sesso == "F")

weight_male_newborns_sample <- mean(dati_m$Peso)
weight_female_newborns_sample <- mean(dati_f$Peso)

Normal test for variable ‘peso’ divided by variable ‘sesso’.

shapiro.test(dati_m$Peso)
## 
##  Shapiro-Wilk normality test
## 
## data:  dati_m$Peso
## W = 0.96647, p-value = 2.321e-16
shapiro.test(dati_f$Peso) 
## 
##  Shapiro-Wilk normality test
## 
## data:  dati_f$Peso
## W = 0.96285, p-value < 2.2e-16

The data do not follow a normal distribution. However, due to the central limit theorem, they can be used for a t-test.

H0: the median weight of males in the sample is equal to 3346.4 g

t.test(x = dati_m$Peso, mu = mu_m_weight) 
## 
##  One Sample t-test
## 
## data:  dati_m$Peso
## t = 4.4152, df = 1243, p-value = 1.097e-05
## alternative hypothesis: true mean is not equal to 3346.4
## 95 percent confidence interval:
##  3380.748 3435.683
## sample estimates:
## mean of x 
##  3408.215

p_value>0.05: I can’t refuse H0.

H0: the median weight of females in the sample is equal to 3232.2 g

t.test(x = dati_f$Peso, mu = mu_f_weight) 
## 
##  One Sample t-test
## 
## data:  dati_f$Peso
## t = -4.7855, df = 1255, p-value = 1.908e-06
## alternative hypothesis: true mean is not equal to 3232.2
## 95 percent confidence interval:
##  3131.997 3190.267
## sample estimates:
## mean of x 
##  3161.132

p_value>0.05: I can’t refuse H0.

H0: the median length of males in the sample is equal to 498.84 mm

t.test(x = dati_f$Lunghezza, mu = mu_m_length) 
## 
##  One Sample t-test
## 
## data:  dati_f$Lunghezza
## t = -11.682, df = 1255, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 498.84
## 95 percent confidence interval:
##  488.2401 491.2885
## sample estimates:
## mean of x 
##  489.7643

p_value>0.05: I can’t refuse H0.

H0: the median weight of females in the sample is equal to 491.48 mm

t.test(x = dati_f$Lunghezza, mu = mu_f_length) 
## 
##  One Sample t-test
## 
## data:  dati_f$Lunghezza
## t = -2.2083, df = 1255, p-value = 0.0274
## alternative hypothesis: true mean is not equal to 491.48
## 95 percent confidence interval:
##  488.2401 491.2885
## sample estimates:
## mean of x 
##  489.7643

p_value>0.05: I can’t refuse H0.

In light of the results obtained: the average weight and length of this sample of newborns are significantly equal to those of the population.

2.2 Creation of the Regression Model

Due to the presence of two NA values, a new dataset has been created that contains only complete observations of the variables to be used in the model. According to the literature and clinical considerations, both Tipo.parto and Ospedale are unlikely to have a significant influence on weight and may confound the model. It is important to apply the BIC test to evaluate the best model.

dati_complete <- dati[complete.cases(dati[, c("N.gravidanze", "Fumatrici", "Gestazione", "Lunghezza", "Cranio", "Sesso", "Anni.madre")]), ]
  1. Model with all potentially relevant variables.
mod<-lm(Peso ~ . - Tipo.parto - Ospedale, data=dati_complete) 

summary(mod)
## 
## Call:
## lm(formula = Peso ~ . - Tipo.parto - Ospedale, data = dati_complete)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1160.6  -181.3   -15.7   163.6  2630.7 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -6712.2405   141.3339 -47.492  < 2e-16 ***
## Anni.madre       0.8803     1.1491   0.766    0.444    
## N.gravidanze    11.3789     4.6767   2.433    0.015 *  
## Fumatrici      -30.3958    27.6080  -1.101    0.271    
## Gestazione      32.9472     3.8288   8.605  < 2e-16 ***
## Lunghezza       10.2316     0.3011  33.979  < 2e-16 ***
## Cranio          10.5198     0.4271  24.633  < 2e-16 ***
## SessoM          78.0787    11.2132   6.963 4.24e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 274.7 on 2490 degrees of freedom
## Multiple R-squared:  0.7272, Adjusted R-squared:  0.7264 
## F-statistic: 948.3 on 7 and 2490 DF,  p-value: < 2.2e-16
#Test for multicollinearity using VIF.
vif(mod)
##   Anni.madre N.gravidanze    Fumatrici   Gestazione    Lunghezza       Cranio 
##     1.189264     1.187447     1.006692     1.694331     2.079749     1.628987 
##        Sesso 
##     1.040493

No variable shows a VIF factor greater than 5, so there is no risk of multicollinearity.

  1. Due to the high value of the Anni.madre variable, it has been decided not to consider it. Fumatrici remains in the model as a control variable despite the p-value > 0.05, for clinical or literature considerations.
mod1 <- update(mod, . ~ . - Anni.madre)

summary(mod1)
## 
## Call:
## lm(formula = Peso ~ N.gravidanze + Fumatrici + Gestazione + Lunghezza + 
##     Cranio + Sesso, data = dati_complete)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1150.24  -181.32   -15.73   162.95  2635.69 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -6682.2637   135.7983 -49.207  < 2e-16 ***
## N.gravidanze    12.6996     4.3470   2.921  0.00352 ** 
## Fumatrici      -30.5728    27.6048  -1.108  0.26818    
## Gestazione      32.6437     3.8079   8.573  < 2e-16 ***
## Lunghezza       10.2309     0.3011  33.979  < 2e-16 ***
## Cranio          10.5366     0.4265  24.707  < 2e-16 ***
## SessoM          78.1596    11.2117   6.971 4.01e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 274.7 on 2491 degrees of freedom
## Multiple R-squared:  0.7271, Adjusted R-squared:  0.7265 
## F-statistic:  1106 on 6 and 2491 DF,  p-value: < 2.2e-16

After removing the variable ‘Anni.madre’, which was not statistically significant, the model essentially retains the same explanatory power.

2.3 Selection of the Optimal Model

BIC(mod,mod1)
##      df      BIC
## mod   9 35207.48
## mod1  8 35200.24

The comparison between the models using the Bayesian information criterion (BIC) indicates that the model without the variable ‘Anni.madre’ (model mod1) is the best choice. In fact, mod1 has a lower BIC value (35200.24) than the complete model (mod), which includes ‘Anni.madre’ (BIC = 35207.48). This result suggests that excluding ‘Anni.madre’ allows for a more parsimonious model without significantly compromising the quality of the fit to the data. Consequently, model mod1 was selected as the final model for subsequent analyses.

Before moving on to the next step and select the final model, it would be interesting to consider the combination of different values:

  • Combination of Sesso and Fumatrici: verify whether the effect of maternal smoking varies according to the sex of the newborn

    mod2 <- update(mod1, . ~ . +Sesso:Fumatrici)
    summary(mod2)
    ## 
    ## Call:
    ## lm(formula = Peso ~ N.gravidanze + Fumatrici + Gestazione + Lunghezza + 
    ##     Cranio + Sesso + Fumatrici:Sesso, data = dati_complete)
    ## 
    ## Residuals:
    ##      Min       1Q   Median       3Q      Max 
    ## -1151.02  -182.19   -16.34   163.02  2636.68 
    ## 
    ## Coefficients:
    ##                    Estimate Std. Error t value Pr(>|t|)    
    ## (Intercept)      -6685.1982   135.8916 -49.195  < 2e-16 ***
    ## N.gravidanze        12.7188     4.3476   2.925  0.00347 ** 
    ## Fumatrici          -11.9447    40.0956  -0.298  0.76580    
    ## Gestazione          32.6686     3.8086   8.578  < 2e-16 ***
    ## Lunghezza           10.2326     0.3011  33.980  < 2e-16 ***
    ## Cranio              10.5378     0.4265  24.706  < 2e-16 ***
    ## SessoM              79.5898    11.4331   6.961 4.29e-12 ***
    ## Fumatrici:SessoM   -35.3270    55.1420  -0.641  0.52181    
    ## ---
    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    ## 
    ## Residual standard error: 274.7 on 2490 degrees of freedom
    ## Multiple R-squared:  0.7272, Adjusted R-squared:  0.7264 
    ## F-statistic: 948.2 on 7 and 2490 DF,  p-value: < 2.2e-16

    The analysis assessed whether the maternal effect of smoking (‘Fumatici’) on neonatal weight was modified by the sex of the newborn, including an interaction term between the two variables in the model. Despite the plausible clinical hypothesis, the results show that the interaction coefficient is not statistically significant (p = 0.52), indicating no evidence of a differential effect of smoking between males and females. Therefore, it is concluded that the inclusion of the ‘Fumatici:Sesso’ interaction does not make a significant contribution to explaining the variability in neonatal weight in the present dataset.

  • Combination of Anni.madre and Gestazione: verify the impact of gestation length on newborn weight may vary depending on the mother’s age.

    mod3 <- update(mod1, . ~ . + Anni.madre:Gestazione)
    summary(mod3)
    ## 
    ## Call:
    ## lm(formula = Peso ~ N.gravidanze + Fumatrici + Gestazione + Lunghezza + 
    ##     Cranio + Sesso + Gestazione:Anni.madre, data = dati_complete)
    ## 
    ## Residuals:
    ##      Min       1Q   Median       3Q      Max 
    ## -1162.30  -181.83   -15.07   163.94  2629.70 
    ## 
    ## Coefficients:
    ##                         Estimate Std. Error t value Pr(>|t|)    
    ## (Intercept)           -6.687e+03  1.359e+02 -49.201  < 2e-16 ***
    ## N.gravidanze           1.113e+01  4.676e+00   2.380   0.0174 *  
    ## Fumatrici             -3.036e+01  2.761e+01  -1.100   0.2715    
    ## Gestazione             3.223e+01  3.835e+00   8.402  < 2e-16 ***
    ## Lunghezza              1.023e+01  3.011e-01  33.979  < 2e-16 ***
    ## Cranio                 1.052e+01  4.270e-01  24.626  < 2e-16 ***
    ## SessoM                 7.807e+01  1.121e+01   6.963 4.26e-12 ***
    ## Gestazione:Anni.madre  2.704e-02  2.957e-02   0.914   0.3607    
    ## ---
    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    ## 
    ## Residual standard error: 274.7 on 2490 degrees of freedom
    ## Multiple R-squared:  0.7272, Adjusted R-squared:  0.7265 
    ## F-statistic: 948.4 on 7 and 2490 DF,  p-value: < 2.2e-16

    The analysis tested the interaction between gestation length and maternal age to assess whether the effect of gestation on neonatal weight varies according to the mother’s age. The interaction coefficient is not statistically significant (p = 0.36), suggesting that maternal age does not modify the association between gestation and weight. The model with interaction does not show a clear improvement in explanatory power compared to the model without interaction, as also evidenced by the similar R² values.

  • Combination of Fumatrici and Gestazione verify smoking could affect the duration of pregnancy and therefore the birth weight of the baby.

    mod4 <- update(mod1, . ~ . + Fumatrici:Gestazione)
    summary(mod4)
    ## 
    ## Call:
    ## lm(formula = Peso ~ N.gravidanze + Fumatrici + Gestazione + Lunghezza + 
    ##     Cranio + Sesso + Fumatrici:Gestazione, data = dati_complete)
    ## 
    ## Residuals:
    ##      Min       1Q   Median       3Q      Max 
    ## -1149.83  -181.58   -16.95   163.76  2635.68 
    ## 
    ## Coefficients:
    ##                        Estimate Std. Error t value Pr(>|t|)    
    ## (Intercept)          -6699.6947   136.7290 -49.000  < 2e-16 ***
    ## N.gravidanze            12.7452     4.3470   2.932   0.0034 ** 
    ## Fumatrici              795.7016   757.5315   1.050   0.2936    
    ## Gestazione              33.2007     3.8418   8.642  < 2e-16 ***
    ## Lunghezza               10.2252     0.3011  33.957  < 2e-16 ***
    ## Cranio                  10.5313     0.4265  24.693  < 2e-16 ***
    ## SessoM                  78.7436    11.2241   7.016 2.94e-12 ***
    ## Fumatrici:Gestazione   -21.0469    19.2830  -1.091   0.2752    
    ## ---
    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    ## 
    ## Residual standard error: 274.7 on 2490 degrees of freedom
    ## Multiple R-squared:  0.7273, Adjusted R-squared:  0.7265 
    ## F-statistic: 948.6 on 7 and 2490 DF,  p-value: < 2.2e-16

    The analysis included an interaction between maternal smoking and gestation length to verify whether the effect of gestation on neonatal weight changes depending on the mother’s smoking status. Although negative, the interaction coefficient is not statistically significant (p = 0.275), indicating that the available data show no evidence of a modifying effect of smoking on the influence of gestation on newborn weight. Consequently, the model with this interaction does not show a substantial improvement over the model without it, as confirmed by the similar value of R². Therefore, the inclusion of this interaction term does not appear justified.

  • Combination of Fumatrici and Anni.madre: verify whether the impact of smoking on newborn weight varies according to maternal age.

    mod5 <- update(mod1, . ~ . + Fumatrici:Anni.madre)
    summary(mod5)
    ## 
    ## Call:
    ## lm(formula = Peso ~ N.gravidanze + Fumatrici + Gestazione + Lunghezza + 
    ##     Cranio + Sesso + Fumatrici:Anni.madre, data = dati_complete)
    ## 
    ## Residuals:
    ##      Min       1Q   Median       3Q      Max 
    ## -1150.21  -181.76   -15.66   162.96  2636.28 
    ## 
    ## Coefficients:
    ##                        Estimate Std. Error t value Pr(>|t|)    
    ## (Intercept)          -6682.5271   135.8277 -49.199  < 2e-16 ***
    ## N.gravidanze            12.6170     4.3598   2.894  0.00384 ** 
    ## Fumatrici              -66.7002   144.1633  -0.463  0.64364    
    ## Gestazione              32.6449     3.8087   8.571  < 2e-16 ***
    ## Lunghezza               10.2335     0.3013  33.961  < 2e-16 ***
    ## Cranio                  10.5336     0.4267  24.685  < 2e-16 ***
    ## SessoM                  78.1696    11.2139   6.971 4.02e-12 ***
    ## Fumatrici:Anni.madre     1.2768     5.0008   0.255  0.79849    
    ## ---
    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    ## 
    ## Residual standard error: 274.7 on 2490 degrees of freedom
    ## Multiple R-squared:  0.7272, Adjusted R-squared:  0.7264 
    ## F-statistic:   948 on 7 and 2490 DF,  p-value: < 2.2e-16

    The interaction between “Smokers” and “Mother’s age” has a very small positive coefficient (1.28) and is largely insignificant (p = 0.798), indicating that there is no evidence that the effect of smoking on neonatal weight changes based on the mother’s age in the data analysed. Furthermore, the ‘Smokers’ coefficient alone is not significant, while the other coefficients remain consistent with the previous results.

BIC(mod,mod1,mod2,mod3,mod4,mod5)
##      df      BIC
## mod   9 35207.48
## mod1  8 35200.24
## mod2  9 35207.65
## mod3  9 35207.23
## mod4  9 35206.87
## mod5  9 35208.00

To select the most suitable model, the Bayesian information criterion (BIC) was used.

Several models were compared, including:

  • mod1, basic model without interaction terms (8 parameters, BIC = 35200.24);

  • models with different additional interactions (9 parameters each, BIC between 35206.87 and 35208.00).

The model without interactions (mod1) has the lowest BIC value, indicating a better balance between goodness of fit and model parsimony.

The difference in BIC between mod1 and the other models is greater than 6 points, suggesting substantial evidence in favour of the simpler model.

Furthermore, the added interaction terms are not statistically significant, confirming that their inclusion does not consistently improve the predictive power of the model.

Therefore, it was deemed appropriate to adopt the model without interactions as the final model for the analysis.

2.3.1 Comments on the best model

summary(mod1)
## 
## Call:
## lm(formula = Peso ~ N.gravidanze + Fumatrici + Gestazione + Lunghezza + 
##     Cranio + Sesso, data = dati_complete)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1150.24  -181.32   -15.73   162.95  2635.69 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -6682.2637   135.7983 -49.207  < 2e-16 ***
## N.gravidanze    12.6996     4.3470   2.921  0.00352 ** 
## Fumatrici      -30.5728    27.6048  -1.108  0.26818    
## Gestazione      32.6437     3.8079   8.573  < 2e-16 ***
## Lunghezza       10.2309     0.3011  33.979  < 2e-16 ***
## Cranio          10.5366     0.4265  24.707  < 2e-16 ***
## SessoM          78.1596    11.2117   6.971 4.01e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 274.7 on 2491 degrees of freedom
## Multiple R-squared:  0.7271, Adjusted R-squared:  0.7265 
## F-statistic:  1106 on 6 and 2491 DF,  p-value: < 2.2e-16
  • The intercept (-6682.26) represents the estimated baseline Peso when all predictor variables are zero; however, this value may not have practical meaning if zero values are outside the observed data range.

  • For each additional pregnancy (N.gravidanze), the expected Peso increases on average by approximately 12.70 units, holding all other variables constant.

  • The coefficient for Fumatrici (-30.57) suggests a decrease in Peso for smokers, but this effect is not statistically significant (p = 0.268), so no firm conclusion can be drawn.

  • Each additional unit of Gestazione is associated with an average increase of about 32.64 units in Peso, controlling for other variables.

  • Lunghezza has a positive effect, with an average increase of 10.23 units in Peso per unit increase in Lunghezza.

  • Cranio is also positively associated, with an average Peso increase of 10.54 units for each unit increase in Cranio.

  • The variable SessoM indicates that males have on average 78.16 units higher Peso compared to females, holding other factors constant.

2.4 Model Quality Analysis

par(mfrow=c(2,2)) 
plot(mod1)

  1. It seems that the homoscedasticity assumption is being respected. Observations are distributed in a random way around the line located at value 0.

  2. Most of the points follow the diagonal line in the Q-Q plot, it means that the model residuals are approximately normal.

  3. Even with the Scale-Location, the model seems to respect the homoscedasticity assumption.

  4. Most of the residuals are randomly distributed around zero, suggesting a good overall fit.

lmtest::bptest(mod1) 
## 
##  studentized Breusch-Pagan test
## 
## data:  mod1
## BP = 89.841, df = 6, p-value < 2.2e-16

The Breusch-Pagan test reveals significant heteroscedasticity in the model residuals (BP = 89.841, p-value < 2.2e-16), indicating that the variance of the errors is not constant. This could compromise the efficiency of the estimators and the validity of the inferences based on the assumption of homoscedasticity.

lmtest::dwtest(mod1)
## 
##  Durbin-Watson test
## 
## data:  mod1
## DW = 1.9539, p-value = 0.1243
## alternative hypothesis: true autocorrelation is greater than 0

The Durbin-Watson test value (DW = 1.9539, p-value = 0.1243) does not detect significant positive autocorrelation in the residuals. Therefore, we can assume that the errors are independent of each other, at least with regard to first-order autocorrelation.

shapiro.test(mod1$residuals) 
## 
##  Shapiro-Wilk normality test
## 
## data:  mod1$residuals
## W = 0.97416, p-value < 2.2e-16
plot(density(residuals(mod1)))

The Shapiro-Wilk test rejects the hypothesis of normality of the residuals (W = 0.97416, p-value < 2.2e-16), suggesting that the distribution of errors deviates from normality. This may affect the reliability of inferences based on asymptotic and parametric properties.

2.5 Outlier study and considerations

#leverage
lev<-hatvalues(mod1) 
plot(lev)
p<-sum(lev)
n<-length(lev)

soglia=2*p/n

abline(h=soglia,col=2)

lev[lev>soglia]
##          13          15          34          67          89          99 
## 0.005703347 0.007069519 0.006759391 0.005897170 0.012912032 0.010439560 
##         101         105         106         120         128         131 
## 0.007549323 0.010620656 0.014509942 0.010039207 0.011385756 0.007243263 
##         134         140         151         155         161         182 
## 0.007594636 0.011384225 0.010960845 0.007238247 0.020454182 0.011315127 
##         194         204         206         220         234         242 
## 0.010838669 0.014580285 0.009489217 0.007449878 0.010852452 0.010236626 
##         251         279         294         296         306         310 
## 0.010893968 0.010509906 0.005976412 0.010171622 0.010858365 0.028816077 
##         312         321         335         378         391         413 
## 0.013208271 0.010712780 0.010921940 0.015942776 0.010946451 0.010551081 
##         424         442         445         473         492         516 
## 0.010779018 0.016102594 0.007514853 0.011311323 0.008307115 0.013189756 
##         538         557         567         572         582         587 
## 0.012114779 0.010675469 0.010349074 0.010613068 0.011722487 0.008415937 
##         592         593         638         656         658         668 
## 0.006385650 0.010424563 0.006700518 0.006025333 0.011305424 0.011516875 
##         684         697         699         703         748         750 
## 0.008843220 0.005873668 0.011104540 0.010766247 0.008582403 0.006967238 
##         757         758         765         805         828         913 
## 0.008194171 0.011589585 0.006081784 0.014371442 0.007252546 0.005646755 
##         928         932         946         947         956         984 
## 0.022759679 0.010473049 0.006931361 0.008436113 0.007861539 0.010405359 
##         985        1014        1017        1026        1037        1051 
## 0.007132846 0.008524032 0.011237743 0.011627834 0.010358071 0.010766679 
##        1067        1091        1106        1110        1118        1130 
## 0.008474975 0.008958973 0.006006367 0.010413856 0.010364104 0.032015721 
##        1170        1175        1181        1188        1219        1227 
## 0.010797183 0.010504729 0.005679516 0.006486375 0.030880280 0.011904117 
##        1238        1248        1262        1271        1273        1282 
## 0.005914871 0.014645989 0.012913934 0.010119833 0.007089555 0.010435158 
##        1285        1291        1293        1311        1321        1326 
## 0.012202207 0.006161061 0.006100990 0.009678935 0.009356101 0.011063731 
##        1333        1357        1368        1379        1385        1397 
## 0.011374270 0.006967713 0.011083301 0.010729178 0.012642335 0.011244147 
##        1398        1400        1410        1411        1415        1425 
## 0.010899348 0.005932589 0.012148484 0.008129552 0.010395643 0.010298620 
##        1426        1428        1429        1443        1449        1450 
## 0.013001527 0.008248631 0.021765511 0.011273327 0.011023383 0.015266009 
##        1458        1473        1480        1505        1512        1525 
## 0.010509906 0.010705195 0.011506319 0.013430500 0.011239860 0.010430013 
##        1537        1551        1553        1556        1576        1583 
## 0.011945912 0.048928246 0.008509382 0.005962153 0.010628946 0.012627911 
##        1593        1610        1619        1626        1652        1660 
## 0.005670938 0.008730820 0.015089828 0.011105265 0.011302561 0.011289996 
##        1672        1686        1691        1701        1712        1718 
## 0.010904312 0.009356741 0.010807668 0.010860519 0.006999602 0.007041085 
##        1720        1727        1761        1763        1780        1781 
## 0.010997587 0.013379628 0.011312431 0.010749679 0.025549378 0.016924209 
##        1789        1809        1827        1902        1906        1920 
## 0.010798211 0.008713628 0.006080039 0.010577043 0.010376884 0.014345163 
##        1929        1933        1971        1977        2003        2016 
## 0.012565315 0.010997544 0.012329394 0.006935578 0.011147894 0.013534068 
##        2040        2046        2049        2086        2089        2101 
## 0.011551043 0.014287924 0.010439560 0.013305448 0.015640851 0.011514295 
##        2110        2114        2115        2120        2140        2145 
## 0.010609140 0.013334510 0.011783317 0.018667294 0.006263686 0.010268870 
##        2146        2148        2149        2157        2175        2200 
## 0.005836472 0.007986616 0.013612587 0.005970731 0.032600634 0.011679528 
##        2202        2216        2220        2221        2224        2237 
## 0.010368854 0.008122716 0.013757467 0.021759446 0.005850764 0.010698612 
##        2238        2244        2245        2256        2257        2270 
## 0.010965495 0.006995602 0.013623314 0.010583299 0.006186646 0.011003034 
##        2282        2285        2307        2317        2337        2353 
## 0.011000085 0.010708672 0.013986777 0.007753671 0.014207376 0.012975764 
##        2359        2361        2408        2412        2422        2437 
## 0.010107658 0.010627641 0.009714482 0.010413631 0.021701995 0.023964759 
##        2450        2452        2458        2459        2465        2471 
## 0.010627964 0.023848473 0.008509147 0.010213499 0.011318339 0.021049712 
##        2478 
## 0.005777114
#outliers
plot(rstudent(mod1))
abline(h=c(-2,2))

car::outlierTest(mod1)
##       rstudent unadjusted p-value Bonferroni p
## 1551 10.033980         2.9696e-23   7.4182e-20
## 155   5.019634         5.5428e-07   1.3846e-03
## 1306  4.820812         1.5158e-06   3.7865e-03
cook<-cooks.distance(mod1)
plot(cook,ylim = c(0,1)) 

excludi <- c(1551, 155, 1306)
data_filtrato <- dati_complete[-excludi, ] 
mod_filtrato <- update(mod1, data = data_filtrato) 
summary(mod_filtrato)
## 
## Call:
## lm(formula = Peso ~ N.gravidanze + Fumatrici + Gestazione + Lunghezza + 
##     Cranio + Sesso, data = data_filtrato)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1148.85  -180.85   -14.38   163.68  2651.12 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -6710.0504   135.0574 -49.683  < 2e-16 ***
## N.gravidanze    12.5941     4.3235   2.913  0.00361 ** 
## Fumatrici      -29.7504    27.4235  -1.085  0.27809    
## Gestazione      33.4626     3.7932   8.822  < 2e-16 ***
## Lunghezza       10.2909     0.3001  34.286  < 2e-16 ***
## Cranio          10.4362     0.4240  24.612  < 2e-16 ***
## SessoM          77.1850    11.1501   6.922 5.63e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 272.9 on 2488 degrees of freedom
## Multiple R-squared:  0.7304, Adjusted R-squared:  0.7297 
## F-statistic:  1123 on 6 and 2488 DF,  p-value: < 2.2e-16
dati_complete[excludi, ]  
##      Anni.madre N.gravidanze Fumatrici Gestazione Peso Lunghezza Cranio
## 1553         30            4         0         35 4520       520    360
## 155          30            0         0         36 3610       410    330
## 1307         29            0         0         42 3560       510    355
##      Tipo.parto Ospedale Sesso
## 1553        Nat     osp2     F
## 155         Nat     osp1     M
## 1307        Nat     osp1     F

The diagnostic analysis of the regression model was initiated by evaluating the leverage of the observations in order to identify those with a potentially excessive influence on the estimated parameters. Observations with leverage above the reference threshold were considered potentially problematic, as they could distort the model estimate.

Subsequently, the distribution of studentised residuals was examined, highlighting those exceeding the limits of ±2, identified as possible outliers in the response values. The application of a specific statistical test made it possible to identify observations with statistically significant residuals.

Cook’s distance was used to measure the overall influence of each observation on the model, integrating information on leverage and residuals. Observations with high values of this distance were recognised as particularly influential.

To assess the effect of influential observations on the validity of the model, the model was reconstructed excluding them. The comparison between the original model and the one obtained from the filtered dataset showed marginal improvements in quality metrics (standard residual and coefficient of determination R^2), while the estimated coefficients were substantially stable.

These results indicate good model robustness, while recognising the significant role of influential observations. Given the plausibility of the values of the possible outlier observations, it was decided to keep these data in the training model.

3. Forecasts and Results

new_case <- data.frame( 
  Gestazione = 39, 
  Lunghezza = mean(dati_f$Lunghezza),
  Cranio = mean(dati_f$Cranio), 
  Sesso = "F",
  Fumatrici = 0,
  Tipo.parto = "Nat",
  N.gravidanze = 3)

predicted_Peso <- predict(mod1,new_case)
new_case$Predicted_Peso <- predicted_Peso 
kable(new_case)
Gestazione Lunghezza Cranio Sesso Fumatrici Tipo.parto N.gravidanze Predicted_Peso
39 489.7643 337.633 F 0 Nat 3 3197.163

4. Views

#dataframe with the varabiles: "Gestazione" in range (37,42) and "Fumatrici" (0,1) 
data <- expand.grid(
  Gestazione = seq(37, 42, by = 1),
  Fumatrici = c(0, 1),
  Sesso = c("F", "M")
)

# Calculate the averages to be assigned
media_Lunghezza_F <- mean(dati_f$Lunghezza)
media_Cranio_F <- mean(dati_f$Cranio)

media_Lunghezza_M <- mean(dati_m$Lunghezza)
media_Cranio_M <- mean(dati_m$Cranio)

media_N_gravidanze <- mean(dati_complete$N.gravidanze)

# assign the variables Length and Skull based on Gender using ifelse
data <- data %>%
  mutate(
    Lunghezza = ifelse(Sesso == "F", media_Lunghezza_F, media_Lunghezza_M),
    Cranio = ifelse(Sesso == "F", media_Cranio_F, media_Cranio_M),
    N.gravidanze = media_N_gravidanze
  )

# Prediction calculation
data$Peso_predetto <- predict(mod1, newdata = data)

# show the table
kable(data)
Gestazione Fumatrici Sesso Lunghezza Cranio N.gravidanze Peso_predetto
37 0 F 489.7643 337.6330 0.9815853 3106.243
38 0 F 489.7643 337.6330 0.9815853 3138.886
39 0 F 489.7643 337.6330 0.9815853 3171.530
40 0 F 489.7643 337.6330 0.9815853 3204.174
41 0 F 489.7643 337.6330 0.9815853 3236.818
42 0 F 489.7643 337.6330 0.9815853 3269.461
37 1 F 489.7643 337.6330 0.9815853 3075.670
38 1 F 489.7643 337.6330 0.9815853 3108.314
39 1 F 489.7643 337.6330 0.9815853 3140.957
40 1 F 489.7643 337.6330 0.9815853 3173.601
41 1 F 489.7643 337.6330 0.9815853 3206.245
42 1 F 489.7643 337.6330 0.9815853 3238.889
37 0 M 499.6672 342.4486 0.9815853 3336.457
38 0 M 499.6672 342.4486 0.9815853 3369.101
39 0 M 499.6672 342.4486 0.9815853 3401.745
40 0 M 499.6672 342.4486 0.9815853 3434.388
41 0 M 499.6672 342.4486 0.9815853 3467.032
42 0 M 499.6672 342.4486 0.9815853 3499.676
37 1 M 499.6672 342.4486 0.9815853 3305.885
38 1 M 499.6672 342.4486 0.9815853 3338.528
39 1 M 499.6672 342.4486 0.9815853 3371.172
40 1 M 499.6672 342.4486 0.9815853 3403.816
41 1 M 499.6672 342.4486 0.9815853 3436.459
42 1 M 499.6672 342.4486 0.9815853 3469.103
# Line graph
data$Fumo_descr <- ifelse(data$Fumatrici == 0, "Non fumatrice", "Fumatrice")

data$Gruppo <- interaction(data$Sesso, data$Fumo_descr, sep = "_")
ggplot(data, aes(x = Gestazione, y = Peso_predetto, color = Gruppo)) +
  geom_line(linewidth = 1) +
  labs(
    title = "Impatto combinato di Sesso e Fumo sul Peso Predetto",
    x = "Settimane di Gestazione",
    y = "Peso Predetto (grammi)",
    color = "Gruppo (Sesso_Fumo)"
  ) +
  theme_minimal()

The results show how the Weight variable is influenced by Gender and Smoking. The graphic representation consists of lines showing the predicted weight as a function of weeks of gestation.

It can be seen that the predicted weight for male newborns is always higher than that for females, whether the mother is a smoker or not.

As the variable ‘Settimane di Gestazione’ increases, the weight of newborns increases, whether the mother is a smoker or not.

Ultimately, smoking significantly impacts the weight of newborns. Newborns whose mothers smoke tend to be lighter, regardless of gender.

Dataset and Model Limitations:

  1. The dataset used is limited in size, which may reduce the statistical power of some tests and limit the generalisability of the results.

  2. Preliminary analyses showed that the response variable ‘Peso’ does not follow a normal distribution, exhibiting outliers. The normality and heteroscedasticity of the residuals were also not fully resolved, which may affect the validity of confidence intervals and related statistical tests.