Original dataset

Anni.madre N.gravidanze Fumatrici Gestazione Peso Lunghezza Cranio Tipo.parto Ospedale Sesso
26 0 0 42 3380 490 325 Nat osp3 M
21 2 0 39 3150 490 345 Nat osp1 F
34 3 0 38 3640 500 375 Nat osp2 M
28 1 0 41 3690 515 365 Nat osp2 M
20 0 0 38 3700 480 335 Nat osp3 F
32 0 0 40 3200 495 340 Nat osp2 F
Anni.madre N.gravidanze Fumatrici Gestazione Peso Lunghezza Cranio Tipo.parto Ospedale Sesso
Min. : 0.00 Min. : 0.0000 Min. :0.0000 Min. :25.00 Min. : 830 Min. :310.0 Min. :235 Ces: 728 osp1:816 F:1256
1st Qu.:25.00 1st Qu.: 0.0000 1st Qu.:0.0000 1st Qu.:38.00 1st Qu.:2990 1st Qu.:480.0 1st Qu.:330 Nat:1772 osp2:849 M:1244
Median :28.00 Median : 1.0000 Median :0.0000 Median :39.00 Median :3300 Median :500.0 Median :340 NA osp3:835 NA
Mean :28.16 Mean : 0.9812 Mean :0.0416 Mean :38.98 Mean :3284 Mean :494.7 Mean :340 NA NA NA
3rd Qu.:32.00 3rd Qu.: 1.0000 3rd Qu.:0.0000 3rd Qu.:40.00 3rd Qu.:3620 3rd Qu.:510.0 3rd Qu.:350 NA NA NA
Max. :46.00 Max. :12.0000 Max. :1.0000 Max. :43.00 Max. :4930 Max. :565.0 Max. :390 NA NA NA
## 
##  Shapiro-Wilk normality test
## 
## data:  Peso
## W = 0.97066, p-value < 2.2e-16

The Shapiro test on Peso variable rejects the null hypothesis that the sample values are distributed as a normal random variable.

Data Visualization

Identified potential non-linearities:

  • Peso-Lunghezza

  • Peso-Cranio

  • Peso-Gestazione

## 
##  Welch Two Sample t-test
## 
## data:  Peso by Sesso
## t = -12.106, df = 2490.7, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group F and group M is not equal to 0
## 95 percent confidence interval:
##  -287.1051 -207.0615
## sample estimates:
## mean in group F mean in group M 
##        3161.132        3408.215
##               Df    Sum Sq Mean Sq F value Pr(>F)
## Ospedale       2    936237  468118   1.699  0.183
## Residuals   2497 687952305  275512
## 
##  Welch Two Sample t-test
## 
## data:  Peso by Tipo.parto
## t = -0.12968, df = 1493, p-value = 0.8968
## alternative hypothesis: true difference in means between group Ces and group Nat is not equal to 0
## 95 percent confidence interval:
##  -46.27992  40.54037
## sample estimates:
## mean in group Ces mean in group Nat 
##          3282.047          3284.916

- t-test Peso vs Sesso:

p value very small –> the difference between the averages of the weight variable is significant –> Sex will have to be kept as a control variable (best practice in medical analysis).

  • t-test Peso vs Tipo.parto:

p value very high –> the difference between the averages of the weight variable is not significant –> It will be removed from the variables.

Hypotesis testing

Alfa value always set to 0.05 (5%).

Hypotesis 1:

H0: The proportion of cesarean deliveries is the same across hospitals.

H1: At least one hospital differs in its cesarean rate.

## 
##  Pearson's Chi-squared test
## 
## data:  tab_osp_parto
## X-squared = 1.083, df = 2, p-value = 0.5819

The p-value is > 0.05, so there is no statistical evidence to reject the equality of the proportions of cesarean sections. In other words, from the available data, the 3 hospitals do not show significant differences in terms of frequency of caesarean sections.

Hypotesis 2:

H0: The proportion of cesarean deliveries is the same across hospitals.

H1: At least one hospital differs in its cesarean rate.

## 
##  One Sample t-test
## 
## data:  neonati_clean$Peso
## t = 8.0108, df = 2497, p-value = 1.731e-15
## alternative hypothesis: true mean is not equal to 3200
## 95 percent confidence interval:
##  3263.577 3304.791
## sample estimates:
## mean of x 
##  3284.184
## 
##  One Sample t-test
## 
## data:  neonati_clean$Lunghezza
## t = 844.17, df = 2497, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 50
## 95 percent confidence interval:
##  493.6628 495.7287
## sample estimates:
## mean of x 
##  494.6958

p-value < 0.05, suggesting that the mean of our sample is different from the hypothesized parameters. This implies that newborns in the dataset, on average, do not exactly match the standard values ​​of 3200 g and/or 50 cm, with possible clinical implications.

Hypotesis 3

H0: Mean (Peso or Lunghezza) is the same for M and F.

H1: At least one differs.

## 
##  Welch Two Sample t-test
## 
## data:  Peso by Sesso
## t = -12.115, df = 2488.7, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group F and group M is not equal to 0
## 95 percent confidence interval:
##  -287.4841 -207.3844
## sample estimates:
## mean in group F mean in group M 
##        3161.061        3408.496
## 
##  Welch Two Sample t-test
## 
## data:  Lunghezza by Sesso
## t = -9.5823, df = 2457.3, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group F and group M is not equal to 0
## 95 percent confidence interval:
##  -11.939001  -7.882672
## sample estimates:
## mean in group F mean in group M 
##        489.7641        499.6750

p-value < 0.05, therefore there is a statistically significant difference in weight and length between males and females, in line with many researches documenting slight anthropometric differences by sex at birth.

  1. Building a linear regression model

3.1. Multiple regression model: baseline

## 
## Call:
## lm(formula = Peso ~ ., data = neonati_clean)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1123.26  -181.53   -14.45   161.05  2611.89 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -6735.7960   141.4790 -47.610  < 2e-16 ***
## Anni.madre        0.8018     1.1467   0.699   0.4845    
## N.gravidanze     11.3812     4.6686   2.438   0.0148 *  
## Fumatrici       -30.2741    27.5492  -1.099   0.2719    
## Gestazione       32.5773     3.8208   8.526  < 2e-16 ***
## Lunghezza        10.2922     0.3009  34.207  < 2e-16 ***
## Cranio           10.4722     0.4263  24.567  < 2e-16 ***
## Tipo.partoNat    29.6335    12.0905   2.451   0.0143 *  
## Ospedaleosp2    -11.0912    13.4471  -0.825   0.4096    
## Ospedaleosp3     28.2495    13.5054   2.092   0.0366 *  
## SessoM           77.5723    11.1865   6.934 5.18e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 274 on 2487 degrees of freedom
## Multiple R-squared:  0.7289, Adjusted R-squared:  0.7278 
## F-statistic: 668.7 on 10 and 2487 DF,  p-value: < 2.2e-16

Adjusted R-squared: 0.7278

As hypothesized, smoking has negative impact on infant weight, however, the high p-value (> 0.05) suggests the low significance of this measure.

The variable Anni.madre turns out to be insignificant, however, it will be retained because it may prove useful for data outside the dataset.

Model Variables explanation:

  • For each additional unit of Anni.madre, the weight increases by 0.8018 grams

  • For every unit of N.Gravidanze more, the weight increases by 11.3812 grams

  • For each additional unit of Fumatrici, the weight decreases by 30.2741 grams

  • For each additional unit of Gestazione, the weight increases by 32.5773 grams

  • For each additional Lunghezza unit, the weight increases by 10.2922 grams

  • For each additional unit of Cranio, the weight increases by 10.4722 grams

  • Compared to the baseline value (Tipo.parto = Ces), Tipo.parto = Nat, implies a increase of weight by 29.6335 grams

  • Compared to the baseline value (Ospedale = Osp1), Ospedale = Osp2, implies a dicrease of weight by 11.0912 grams

  • Compared to the baseline value (Ospedale = Osp1), Ospedale = Osp3, implies an increase of weight by 28.2495 grams

  • Compared to the baseline value (Sesso = F), Sesso = M, implies an increase of weight by 77.5723 grams

3.2. Multiple regression model: backward variables selection

## 
## Call:
## lm(formula = Peso ~ Anni.madre + N.gravidanze + Fumatrici + Gestazione + 
##     Lunghezza + Cranio + Tipo.parto + Sesso, data = neonati_clean)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1140.10  -181.96   -14.86   160.30  2629.68 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -6738.3356   141.5660 -47.599  < 2e-16 ***
## Anni.madre        0.8681     1.1479   0.756   0.4496    
## N.gravidanze     11.6900     4.6733   2.501   0.0124 *  
## Fumatrici       -31.7061    27.5836  -1.149   0.2505    
## Gestazione       32.8963     3.8248   8.601  < 2e-16 ***
## Lunghezza        10.2691     0.3012  34.098  < 2e-16 ***
## Cranio           10.4850     0.4268  24.564  < 2e-16 ***
## Tipo.partoNat    30.3855    12.1052   2.510   0.0121 *  
## SessoM           78.0234    11.2013   6.966 4.17e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 274.4 on 2489 degrees of freedom
## Multiple R-squared:  0.7279, Adjusted R-squared:  0.727 
## F-statistic: 832.3 on 8 and 2489 DF,  p-value: < 2.2e-16
## Analysis of Variance Table
## 
## Model 1: Peso ~ Anni.madre + N.gravidanze + Fumatrici + Gestazione + Lunghezza + 
##     Cranio + Tipo.parto + Sesso
## Model 2: Peso ~ Anni.madre + N.gravidanze + Fumatrici + Gestazione + Lunghezza + 
##     Cranio + Tipo.parto + Ospedale + Sesso
##   Res.Df       RSS Df Sum of Sq      F  Pr(>F)  
## 1   2489 187430749                              
## 2   2487 186743194  2    687555 4.5783 0.01036 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Call:
## lm(formula = Peso ~ Anni.madre + N.gravidanze + Gestazione + 
##     Lunghezza + Cranio + Tipo.parto + Sesso, data = neonati_clean)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1139.50  -181.60   -14.59   160.14  2633.16 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -6737.9269   141.5747 -47.593  < 2e-16 ***
## Anni.madre        0.8793     1.1479   0.766   0.4438    
## N.gravidanze     11.4176     4.6676   2.446   0.0145 *  
## Gestazione       32.6300     3.8180   8.546  < 2e-16 ***
## Lunghezza        10.2839     0.3009  34.176  < 2e-16 ***
## Cranio           10.4896     0.4268  24.574  < 2e-16 ***
## Tipo.partoNat    30.1222    12.1038   2.489   0.0129 *  
## SessoM           77.8374    11.2008   6.949 4.67e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 274.4 on 2490 degrees of freedom
## Multiple R-squared:  0.7278, Adjusted R-squared:  0.727 
## F-statistic: 950.9 on 7 and 2490 DF,  p-value: < 2.2e-16
## 
## Call:
## lm(formula = Peso ~ Anni.madre + Gestazione + Lunghezza + Cranio + 
##     Tipo.parto + Sesso, data = neonati_clean)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1143.29  -184.07   -15.52   161.15  2617.47 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -6748.7308   141.6472 -47.645  < 2e-16 ***
## Anni.madre        1.9144     1.0681   1.792   0.0732 .  
## Gestazione       32.1537     3.8168   8.424  < 2e-16 ***
## Lunghezza        10.2496     0.3009  34.065  < 2e-16 ***
## Cranio           10.5733     0.4259  24.826  < 2e-16 ***
## Tipo.partoNat    29.3644    12.1120   2.424   0.0154 *  
## SessoM           78.6331    11.2073   7.016 2.93e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 274.7 on 2491 degrees of freedom
## Multiple R-squared:  0.7271, Adjusted R-squared:  0.7264 
## F-statistic:  1106 on 6 and 2491 DF,  p-value: < 2.2e-16
## 
## Call:
## lm(formula = Peso ~ Anni.madre + Gestazione + Lunghezza + Cranio + 
##     Sesso, data = neonati_clean)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1163.03  -184.20   -14.07   163.24  2618.69 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -6723.2290   141.3943 -47.550  < 2e-16 ***
## Anni.madre      1.8995     1.0692   1.777   0.0758 .  
## Gestazione     32.2256     3.8205   8.435  < 2e-16 ***
## Lunghezza      10.2137     0.3008  33.954  < 2e-16 ***
## Cranio         10.6047     0.4261  24.887  < 2e-16 ***
## SessoM         78.6738    11.2182   7.013 2.99e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 275 on 2492 degrees of freedom
## Multiple R-squared:  0.7265, Adjusted R-squared:  0.7259 
## F-statistic:  1324 on 5 and 2492 DF,  p-value: < 2.2e-16
## Start:  AIC=28054.55
## Peso ~ Anni.madre + N.gravidanze + Fumatrici + Gestazione + Lunghezza + 
##     Cranio + Tipo.parto + Ospedale + Sesso
## 
##                Df Sum of Sq       RSS   AIC
## - Anni.madre    1     36710 186779904 28053
## - Fumatrici     1     90677 186833870 28054
## <none>                      186743194 28055
## - N.gravidanze  1    446244 187189438 28058
## - Tipo.parto    1    451073 187194266 28059
## - Ospedale      2    687555 187430749 28060
## - Sesso         1   3610705 190353899 28100
## - Gestazione    1   5458852 192202046 28124
## - Cranio        1  45318506 232061700 28595
## - Lunghezza     1  87861708 274604902 29016
## 
## Step:  AIC=28053.05
## Peso ~ N.gravidanze + Fumatrici + Gestazione + Lunghezza + Cranio + 
##     Tipo.parto + Ospedale + Sesso
## 
##                Df Sum of Sq       RSS   AIC
## - Fumatrici     1     91599 186871503 28052
## <none>                      186779904 28053
## + Anni.madre    1     36710 186743194 28055
## - Tipo.parto    1    452049 187231953 28057
## - Ospedale      2    693914 187473818 28058
## - N.gravidanze  1    631082 187410986 28060
## - Sesso         1   3617809 190397713 28099
## - Gestazione    1   5424800 192204704 28123
## - Cranio        1  45569477 232349381 28596
## - Lunghezza     1  87852027 274631931 29014
## 
## Step:  AIC=28052.27
## Peso ~ N.gravidanze + Gestazione + Lunghezza + Cranio + Tipo.parto + 
##     Ospedale + Sesso
## 
##                Df Sum of Sq       RSS   AIC
## <none>                      186871503 28052
## + Fumatrici     1     91599 186779904 28053
## + Anni.madre    1     37633 186833870 28054
## - Tipo.parto    1    444404 187315907 28056
## - Ospedale      2    702925 187574428 28058
## - N.gravidanze  1    608136 187479640 28058
## - Sesso         1   3601860 190473363 28098
## - Gestazione    1   5358199 192229702 28121
## - Cranio        1  45613331 232484834 28596
## - Lunghezza     1  88259386 275130889 29017
## 
## Call:
## lm(formula = Peso ~ N.gravidanze + Gestazione + Lunghezza + Cranio + 
##     Tipo.parto + Ospedale + Sesso, data = neonati_clean)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1113.07  -181.71   -16.66   161.08  2619.57 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -6707.9252   136.0257 -49.314  < 2e-16 ***
## N.gravidanze     12.3360     4.3344   2.846  0.00446 ** 
## Gestazione       32.0386     3.7925   8.448  < 2e-16 ***
## Lunghezza        10.3059     0.3006  34.286  < 2e-16 ***
## Cranio           10.4920     0.4257  24.648  < 2e-16 ***
## Tipo.partoNat    29.4080    12.0875   2.433  0.01505 *  
## Ospedaleosp2    -10.8939    13.4447  -0.810  0.41786    
## Ospedaleosp3     28.7917    13.4969   2.133  0.03301 *  
## SessoM           77.4657    11.1842   6.926 5.48e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 274 on 2489 degrees of freedom
## Multiple R-squared:  0.7287, Adjusted R-squared:  0.7278 
## F-statistic: 835.7 on 8 and 2489 DF,  p-value: < 2.2e-16
## Start:  AIC=28118.62
## Peso ~ Anni.madre + N.gravidanze + Fumatrici + Gestazione + Lunghezza + 
##     Cranio + Tipo.parto + Ospedale + Sesso
## 
##                Df Sum of Sq       RSS   AIC
## - Anni.madre    1     36710 186779904 28111
## - Fumatrici     1     90677 186833870 28112
## - Ospedale      2    687555 187430749 28112
## - N.gravidanze  1    446244 187189438 28117
## - Tipo.parto    1    451073 187194266 28117
## <none>                      186743194 28119
## - Sesso         1   3610705 190353899 28159
## - Gestazione    1   5458852 192202046 28183
## - Cranio        1  45318506 232061700 28654
## - Lunghezza     1  87861708 274604902 29074
## 
## Step:  AIC=28111.29
## Peso ~ N.gravidanze + Fumatrici + Gestazione + Lunghezza + Cranio + 
##     Tipo.parto + Ospedale + Sesso
## 
##                Df Sum of Sq       RSS   AIC
## - Fumatrici     1     91599 186871503 28105
## - Ospedale      2    693914 187473818 28105
## - Tipo.parto    1    452049 187231953 28110
## <none>                      186779904 28111
## - N.gravidanze  1    631082 187410986 28112
## + Anni.madre    1     36710 186743194 28119
## - Sesso         1   3617809 190397713 28151
## - Gestazione    1   5424800 192204704 28175
## - Cranio        1  45569477 232349381 28649
## - Lunghezza     1  87852027 274631931 29066
## 
## Step:  AIC=28104.69
## Peso ~ N.gravidanze + Gestazione + Lunghezza + Cranio + Tipo.parto + 
##     Ospedale + Sesso
## 
##                Df Sum of Sq       RSS   AIC
## - Ospedale      2    702925 187574428 28098
## - Tipo.parto    1    444404 187315907 28103
## <none>                      186871503 28105
## - N.gravidanze  1    608136 187479640 28105
## + Fumatrici     1     91599 186779904 28111
## + Anni.madre    1     37633 186833870 28112
## - Sesso         1   3601860 190473363 28145
## - Gestazione    1   5358199 192229702 28168
## - Cranio        1  45613331 232484834 28642
## - Lunghezza     1  88259386 275130889 29063
## 
## Step:  AIC=28098.42
## Peso ~ N.gravidanze + Gestazione + Lunghezza + Cranio + Tipo.parto + 
##     Sesso
## 
##                Df Sum of Sq       RSS   AIC
## - Tipo.parto    1    467626 188042054 28097
## <none>                      187574428 28098
## - N.gravidanze  1    648873 188223301 28099
## + Ospedale      2    702925 186871503 28105
## + Fumatrici     1    100610 187473818 28105
## + Anni.madre    1     44184 187530244 28106
## - Sesso         1   3644818 191219246 28139
## - Gestazione    1   5457887 193032315 28162
## - Cranio        1  45747094 233321522 28636
## - Lunghezza     1  87955701 275530129 29051
## 
## Step:  AIC=28096.81
## Peso ~ N.gravidanze + Gestazione + Lunghezza + Cranio + Sesso
## 
##                Df Sum of Sq       RSS   AIC
## <none>                      188042054 28097
## - N.gravidanze  1    621053 188663107 28097
## + Tipo.parto    1    467626 187574428 28098
## + Ospedale      2    726146 187315907 28103
## + Fumatrici     1     92548 187949505 28103
## + Anni.madre    1     45366 187996688 28104
## - Sesso         1   3650790 191692844 28137
## - Gestazione    1   5477493 193519547 28161
## - Cranio        1  46098547 234140601 28637
## - Lunghezza     1  87532691 275574744 29044
## 
## Call:
## lm(formula = Peso ~ N.gravidanze + Gestazione + Lunghezza + Cranio + 
##     Sesso, data = neonati_clean)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1149.37  -180.98   -15.57   163.69  2639.09 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -6681.7251   135.8036 -49.201  < 2e-16 ***
## N.gravidanze    12.4554     4.3416   2.869  0.00415 ** 
## Gestazione      32.3827     3.8008   8.520  < 2e-16 ***
## Lunghezza       10.2455     0.3008  34.059  < 2e-16 ***
## Cranio          10.5410     0.4265  24.717  < 2e-16 ***
## SessoM          77.9807    11.2111   6.956 4.47e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 274.7 on 2492 degrees of freedom
## Multiple R-squared:  0.727,  Adjusted R-squared:  0.7265 
## F-statistic:  1327 on 5 and 2492 DF,  p-value: < 2.2e-16

Model’s adjusted R-squared:

Model 1 = 0.7278 Model 2 = 0.727 Model 3 = 0.727 Model 4 = 0.7265 Model AIC = 0.7278 Model BIC = 0.7265

##         df      BIC
## mod1    12 35215.45
## mod2    10 35208.98
## mod3     9 35202.49
## mod4     8 35200.66
## mod5     7 35198.72
## mod_aic 10 35201.52
## mod_bic  7 35193.65
## N.gravidanze   Gestazione    Lunghezza       Cranio        Sesso 
##     1.023462     1.669779     2.075747     1.624568     1.040184

BIC:

Model obtained using BIC has the lowest BIC value: 35193.65.

VIF:

Almost all values are close to 1 and below the threshold (5), there is no danger of multicollinearity.

3.3. Multiple regression model: non linear effects

## 
## Call:
## lm(formula = Peso ~ N.gravidanze + Gestazione + Lunghezza + Cranio + 
##     Sesso + I(Lunghezza^2), data = neonati_clean)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1169.62  -181.77   -12.79   163.77  1786.03 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    212.288548 723.852095   0.293 0.769336    
## N.gravidanze    14.085464   4.266175   3.302 0.000975 ***
## Gestazione      42.551398   3.876629  10.976  < 2e-16 ***
## Lunghezza      -20.267001   3.162718  -6.408 1.76e-10 ***
## Cranio          10.651783   0.418894  25.428  < 2e-16 ***
## SessoM          69.968733  11.038797   6.338 2.75e-10 ***
## I(Lunghezza^2)   0.031655   0.003267   9.690  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 269.7 on 2491 degrees of freedom
## Multiple R-squared:  0.7369, Adjusted R-squared:  0.7363 
## F-statistic:  1163 on 6 and 2491 DF,  p-value: < 2.2e-16

## 
## Call:
## lm(formula = Peso ~ N.gravidanze + Gestazione + Lunghezza + Cranio + 
##     Sesso + I(Gestazione^2), data = neonati_clean)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1144.0  -181.5   -12.9   165.8  2661.9 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -4646.7158   898.6322  -5.171 2.52e-07 ***
## N.gravidanze       12.5489     4.3381   2.893  0.00385 ** 
## Gestazione        -81.2309    49.7402  -1.633  0.10257    
## Lunghezza          10.3502     0.3040  34.045  < 2e-16 ***
## Cranio             10.6376     0.4282  24.843  < 2e-16 ***
## SessoM             75.7563    11.2435   6.738 1.99e-11 ***
## I(Gestazione^2)     1.5168     0.6621   2.291  0.02206 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 274.5 on 2491 degrees of freedom
## Multiple R-squared:  0.7276, Adjusted R-squared:  0.7269 
## F-statistic:  1109 on 6 and 2491 DF,  p-value: < 2.2e-16

## 
## Call:
## lm(formula = Peso ~ N.gravidanze + Gestazione + Lunghezza + Cranio + 
##     Sesso + I(Cranio^2), data = neonati_clean)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1138.6  -179.4   -14.8   163.4  2622.6 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    84.10118 1151.77280   0.073  0.94180    
## N.gravidanze   12.76356    4.31259   2.960  0.00311 ** 
## Gestazione     38.90540    3.93291   9.892  < 2e-16 ***
## Lunghezza      10.48745    0.30157  34.776  < 2e-16 ***
## Cranio        -31.79371    7.16973  -4.434 9.63e-06 ***
## SessoM         73.10236   11.16590   6.547 7.11e-11 ***
## I(Cranio^2)     0.06262    0.01059   5.915 3.77e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 272.8 on 2491 degrees of freedom
## Multiple R-squared:  0.7308, Adjusted R-squared:  0.7301 
## F-statistic:  1127 on 6 and 2491 DF,  p-value: < 2.2e-16

Model’s adjusted R-squared: Model bic = 0.7265 Model bic_1 = 0.736 Model bic_2 = 0.7269 Model bic_3 = 0.73

##           df      BIC
## mod_bic    7 35193.65
## mod_bic_1  8 35109.04
## mod_bic_2  8 35196.21
## mod_bic_3  8 35166.63
##   N.gravidanze     Gestazione      Lunghezza         Cranio          Sesso 
##       1.025055       1.801815     238.007682       1.625780       1.046053 
## I(Lunghezza^2) 
##     230.033474

BIC:

Model_bic_1 has the lowest BIC value: 35109.04.

VIF:

There is multicollinearity between I(Lunghezza^2) and Lunghezza.

## 
## Call:
## lm(formula = Peso ~ N.gravidanze + Gestazione + Cranio + Sesso + 
##     I(Lunghezza^2), data = neonati_clean)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1161.44  -180.02   -11.17   165.90  2381.94 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -4.335e+03  1.442e+02 -30.059  < 2e-16 ***
## N.gravidanze    1.314e+01  4.298e+00   3.057  0.00226 ** 
## Gestazione      3.481e+01  3.713e+00   9.374  < 2e-16 ***
## Cranio          1.047e+01  4.212e-01  24.845  < 2e-16 ***
## SessoM          7.455e+01  1.110e+01   6.714 2.33e-11 ***
## I(Lunghezza^2)  1.081e-02  3.075e-04  35.160  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 271.9 on 2492 degrees of freedom
## Multiple R-squared:  0.7326, Adjusted R-squared:  0.7321 
## F-statistic:  1365 on 5 and 2492 DF,  p-value: < 2.2e-16
##             df      BIC
## mod_bic      7 35193.65
## mod_bic_1    8 35109.04
## mod_bic_1_1  7 35142.06
##   N.gravidanze     Gestazione         Cranio          Sesso I(Lunghezza^2) 
##       1.023828       1.626672       1.617959       1.041656       2.006202

BIC: The value has increased

VIF: There is no more multicollinearity.

We will keep mod_bic_4.

3.4. Multiple regression model: variables interactions

## 
## Call:
## lm(formula = Peso ~ N.gravidanze + Gestazione + Lunghezza + Cranio + 
##     Sesso + Gestazione:Lunghezza, data = neonati_clean)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1133.41  -179.98   -11.52   168.93  2652.65 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          -1.991e+03  9.206e+02  -2.163 0.030631 *  
## N.gravidanze          1.303e+01  4.321e+00   3.015 0.002594 ** 
## Gestazione           -9.391e+01  2.481e+01  -3.785 0.000157 ***
## Lunghezza            -8.476e-02  2.028e+00  -0.042 0.966661    
## Cranio                1.076e+01  4.264e-01  25.234  < 2e-16 ***
## SessoM                7.225e+01  1.121e+01   6.445 1.38e-10 ***
## Gestazione:Lunghezza  2.729e-01  5.298e-02   5.151 2.79e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 273.3 on 2491 degrees of freedom
## Multiple R-squared:  0.7299, Adjusted R-squared:  0.7292 
## F-statistic:  1122 on 6 and 2491 DF,  p-value: < 2.2e-16
## 
## Call:
## lm(formula = Peso ~ N.gravidanze + Gestazione + Lunghezza + Cranio + 
##     Sesso + Gestazione:Cranio, data = neonati_clean)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1137.04  -181.47   -12.19   167.45  2695.30 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       -187.95215 1106.93645  -0.170  0.86519    
## N.gravidanze        13.12748    4.31382   3.043  0.00237 ** 
## Gestazione        -140.78001   29.53978  -4.766 1.99e-06 ***
## Lunghezza           10.46687    0.30113  34.759  < 2e-16 ***
## Cranio              -9.85430    3.47659  -2.834  0.00463 ** 
## SessoM              72.00219   11.18136   6.439 1.43e-10 ***
## Gestazione:Cranio    0.53389    0.09033   5.910 3.88e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 272.8 on 2491 degrees of freedom
## Multiple R-squared:  0.7308, Adjusted R-squared:  0.7301 
## F-statistic:  1127 on 6 and 2491 DF,  p-value: < 2.2e-16
## 
## Call:
## lm(formula = Peso ~ N.gravidanze + Gestazione + Lunghezza + Cranio + 
##     Sesso + Lunghezza:Cranio, data = neonati_clean)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1150.65  -180.93   -13.48   165.99  2865.46 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -1.803e+03  1.018e+03  -1.771   0.0767 .  
## N.gravidanze      1.293e+01  4.323e+00   2.991   0.0028 ** 
## Gestazione        3.815e+01  3.967e+00   9.616  < 2e-16 ***
## Lunghezza        -3.060e-01  2.203e+00  -0.139   0.8895    
## Cranio           -4.755e+00  3.192e+00  -1.490   0.1365    
## SessoM            7.324e+01  1.120e+01   6.537 7.59e-11 ***
## Lunghezza:Cranio  3.157e-02  6.531e-03   4.835 1.41e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 273.5 on 2491 degrees of freedom
## Multiple R-squared:  0.7296, Adjusted R-squared:  0.7289 
## F-statistic:  1120 on 6 and 2491 DF,  p-value: < 2.2e-16

Model’s adjusted R-squared:

Model_bic = 0.7265. | baseline model

Model_bic_4 = 0.7292 | The new variable introduced has low predictive capacity

Model_bic_5 = 0.7301 | The new variable introduced has high predictive capacity (0.53389)

Model_bic_6 = 0.7289 | The new variable introduced has low predictive capacity

##           df      BIC
## mod_bic    7 35193.65
## mod_bic_4  8 35175.01
## mod_bic_5  8 35166.68
## mod_bic_6  8 35178.14
##      N.gravidanze        Gestazione         Lunghezza            Cranio 
##          1.024173        102.234068          2.108378        109.430962 
##             Sesso Gestazione:Cranio 
##          1.048767        301.410457

BIC:

mod_bic_5 has the lowest BIC value: 35166.68.

VIF:

All values are close to 1 and below the threshold (5), there is no danger of multicollinearity.

## [1] "RMSE del modello mod_bic: 274.37"
## [1] "RMSE del modello mod_bic_1: 269.34"
## [1] "RMSE del modello mod_bic_2: 274.08"
## [1] "RMSE del modello mod_bic_3: 272.46"
## [1] "RMSE del modello mod_bic_4: 272.92"
## [1] "RMSE del modello mod_bic_5: 272.46"

RMSE comparison confirms the selection of mod_bic_5

3.5. Residuals Analysis:

## 
##  Shapiro-Wilk normality test
## 
## data:  residuals(mod_bic_5)
## W = 0.97279, p-value < 2.2e-16
## 
##  studentized Breusch-Pagan test
## 
## data:  mod_bic_5
## BP = 84.705, df = 6, p-value = 3.801e-16
## 
##  Durbin-Watson test
## 
## data:  mod_bic_5
## DW = 1.9576, p-value = 0.1445
## alternative hypothesis: true autocorrelation is greater than 0
##          13          15          34          36          67          89 
## 0.005747330 0.007609026 0.006784240 0.007352213 0.005989360 0.012817577 
##          96         101         106         131         134         151 
## 0.006146495 0.008444940 0.028194760 0.008650920 0.007876461 0.014278965 
##         155         161         204         206         220         249 
## 0.007925092 0.021393082 0.014567541 0.011697548 0.007490072 0.005839164 
##         277         294         305         310         312         315 
## 0.005862438 0.005915532 0.005640770 0.069170355 0.018098857 0.007419291 
##         378         442         445         492         516         565 
## 0.038362637 0.007816488 0.010389038 0.009076806 0.013228625 0.005764124 
##         582         587         592         615         638         656 
## 0.011674287 0.010321764 0.006476594 0.005958671 0.007120330 0.005982558 
##         684         697         706         726         748         750 
## 0.008886536 0.005999991 0.006031410 0.005800762 0.012150748 0.007694247 
##         757         765         805         828         895         928 
## 0.008217726 0.006677521 0.032449475 0.007259111 0.007366315 0.064739305 
##         946         947         956         985        1014        1067 
## 0.007594779 0.009685791 0.008659375 0.007083491 0.008573911 0.010209592 
##        1091        1106        1130        1134        1181        1188 
## 0.012879065 0.006797763 0.033998363 0.006363565 0.007731113 0.006517382 
##        1200        1219        1238        1248        1273        1291 
## 0.005629131 0.030699856 0.005918331 0.031717329 0.007529563 0.006179871 
##        1293        1311        1321        1356        1357        1385 
## 0.006293069 0.009801561 0.010149158 0.006630841 0.007732511 0.018979405 
##        1400        1411        1428        1429        1450        1505 
## 0.005936771 0.008050716 0.008198110 0.034614750 0.015106722 0.013335823 
##        1551        1553        1556        1573        1593        1610 
## 0.050017642 0.010102424 0.007092655 0.005694752 0.005627397 0.010710053 
##        1619        1686        1693        1701        1712        1718 
## 0.021786515 0.009720875 0.005678489 0.011295072 0.007098503 0.007068431 
##        1727        1735        1780        1781        1809        1827 
## 0.013551851 0.005600208 0.105097961 0.016980770 0.015055981 0.006083594 
##        1868        1977        2016        2040        2046        2086 
## 0.006246938 0.008805846 0.008844684 0.011937094 0.005766639 0.013322958 
##        2089        2114        2115        2120        2140        2146 
## 0.006474326 0.023946969 0.012422758 0.054102027 0.007930323 0.005863961 
##        2148        2149        2157        2175        2200        2215 
## 0.008293948 0.024776338 0.006255313 0.110152510 0.015250724 0.005603405 
##        2216        2220        2221        2224        2225        2244 
## 0.008617219 0.005942666 0.021633975 0.005842346 0.006220303 0.006930495 
##        2257        2307        2317        2337        2359        2391 
## 0.006355257 0.026451369 0.007789477 0.006100252 0.010102635 0.006109896 
##        2408        2422        2437        2452        2458        2471 
## 0.013231022 0.021615570 0.058362458 0.109498269 0.010261969 0.021040384 
##        2478 
## 0.005857822
##       rstudent unadjusted p-value Bonferroni p
## 1551 10.348799         1.3258e-24   3.3118e-21
## 155   5.222863         1.9076e-07   4.7653e-04
## 1306  4.736467         2.2969e-06   5.7377e-03
## named numeric(0)

Residuals graph comments

  • Residuals vs fitted:

Residuals are randomly arranged around the mean (0) with no obvious patterns.


  • Q-Q Residuals:

The residuals are disposed along the diagonal, following a normal distribution.


  • Scale-Location:

The variance appears to be constant.


  • Residuals vs Leverage:

There are no points beyond the cook distance.


  • Shapiro-Wilk test:

p-value is 3.378e-15, we fail to refuse the hypotesis of normal distribution.


  • Breusch-Pagan test against heteroskedasticity:

p-value is 2.2e-16, we fail to refuse the hypotesis of heteroskedasticity.


  • Durbin-Watson test for autocorrelation:

p-value is 0.09189, we refuse the null hypotesisof auto correlation.


  • Leverage

There are many leverage values.


  • Outlier

There are many outliers.


  • Beyond cook distance values:

A potentially very influential observation was identified and could distort the model. The observation number 1551.

## [1] "Peso stimato del neonato: 3268.1 grammi"

Graph 1: “N.gravidanze” vs “Peso” by “Sesso”

The graph shows the relationship between “Peso” and “N.gravidanze,” colored according to the variable “Sesso”. There is no obvious difference in the behavior of the two groups according to the variable “Smokers.” Generally, males weigh more than females.

Graph 2: “Fumatrici” vs “Peso” by “Sesso”

The graph shows the relationship between “Peso” and “Fumatrici,” colored according to the variable “Sesso”. The behavior of the two groups is similar because the straight lines have a similar slope. There is no obvious difference in the behavior of the two groups according to the variable “Smokers.” Generally, males weigh more than females.

Graph 2: “Cranio”/“Lunghezza”/“Gestazione” vs “Peso” by “Sesso”

All graphs show a linear correlation between the input and output variables. In each graph the same behavior is confirmed for both sexes with males generally heavier than females.