Original dataset

##   Anni.madre N.gravidanze Fumatrici Gestazione Peso Lunghezza Cranio Tipo.parto
## 1         26            0         0         42 3380       490    325        Nat
## 2         21            2         0         39 3150       490    345        Nat
## 3         34            3         0         38 3640       500    375        Nat
## 4         28            1         0         41 3690       515    365        Nat
## 5         20            0         0         38 3700       480    335        Nat
## 6         32            0         0         40 3200       495    340        Nat
##   Ospedale Sesso
## 1     osp3     M
## 2     osp1     F
## 3     osp2     M
## 4     osp2     M
## 5     osp3     F
## 6     osp2     F

EDA: Exploratory Data Analysis

For each numerical feature is calculated a set o descriptive statics parameter for exploratory data analisys. This can help us understand the data and the studied phenomenon.
VarName VarType N_Null Min Max Mean Median N_Distinct
Anni.madre integer 0 0 46 28.1640 28 36
N.gravidanze integer 0 0 12 0.9812 1 13
Fumatrici integer 0 0 1 0.0416 0 2
Gestazione integer 0 25 43 38.9804 39 19
Peso integer 0 830 4930 3284.0808 3300 287
Lunghezza integer 0 310 565 494.6920 500 56
Cranio integer 0 235 390 340.0292 340 112
Tipo.parto character 0 NA NA NA NA 2
Ospedale character 0 NA NA NA NA 3
Sesso character 0 NA NA NA NA 2

CONCLUSIONS:

The dataset has no missing values. Min, Max, Mean, Median will be useful while analyzing the distribution plots. The distinct count can hel understand the type of the feature (binary, continuos, discrete) and to find unexpted values. (only two distinct values to be expected in fields like [Sesso]).

Data distribution check

CONCLUSIONS:

The plots can help us undestard the data distribution, to highlight the presence of uncorrect values and to evaluate the constraints of our analysis. They also provide informations about the balance of classes.

Data cleanning and filtering

## [1] 2449   10
##    Anni.madre     N.gravidanze      Fumatrici         Gestazione   
##  Min.   :13.00   Min.   :0.0000   Min.   :0.00000   Min.   :28.00  
##  1st Qu.:25.00   1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:38.00  
##  Median :28.00   Median :1.0000   Median :0.00000   Median :39.00  
##  Mean   :28.11   Mean   :0.9053   Mean   :0.04206   Mean   :39.07  
##  3rd Qu.:32.00   3rd Qu.:1.0000   3rd Qu.:0.00000   3rd Qu.:40.00  
##  Max.   :45.00   Max.   :5.0000   Max.   :1.00000   Max.   :43.00  
##       Peso        Lunghezza       Cranio       Tipo.parto       
##  Min.   :1500   Min.   :315   Min.   :275.0   Length:2449       
##  1st Qu.:3000   1st Qu.:480   1st Qu.:330.0   Class :character  
##  Median :3300   Median :500   Median :340.0   Mode  :character  
##  Mean   :3305   Mean   :496   Mean   :340.6                     
##  3rd Qu.:3620   3rd Qu.:510   3rd Qu.:350.0                     
##  Max.   :4930   Max.   :565   Max.   :390.0                     
##    Ospedale            Sesso          
##  Length:2449        Length:2449       
##  Class :character   Class :character  
##  Mode  :character   Mode  :character  
##                                       
##                                       
## 
## 'data.frame':    2449 obs. of  10 variables:
##  $ Anni.madre  : int  26 21 34 28 20 32 26 25 22 23 ...
##  $ N.gravidanze: int  0 2 3 1 0 0 1 0 1 0 ...
##  $ Fumatrici   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Gestazione  : int  42 39 38 41 38 40 39 40 40 41 ...
##  $ Peso        : int  3380 3150 3640 3690 3700 3200 3100 3580 3670 3700 ...
##  $ Lunghezza   : int  490 490 500 515 480 495 480 510 500 510 ...
##  $ Cranio      : int  325 345 375 365 335 340 345 349 335 362 ...
##  $ Tipo.parto  : chr  "Nat" "Nat" "Nat" "Nat" ...
##  $ Ospedale    : chr  "osp3" "osp1" "osp2" "osp2" ...
##  $ Sesso       : chr  "M" "F" "M" "M" ...

The cleaning process has not being based on outlier detection. We used a set of assumprions for filtering the original dataset:

  1. 5 <= Anni.madre <= 12 | A mother too young or too old is unusual and the model could be not applyable to this cases.
  2. Peso >= 1500 | For the Analysis only a certain range of newborn weight has been considered.
  3. N.gravidanze <= 5 | A number of deliveries over 5 is very unusual and can alter the model.

The resultin dataframe [neonati_clean] results to be more focused on more representative situations.

Hypotesis testing

Alfa value always set to 0.05 (5%).

Hypotesis 1:

H0: The proportion of cesarean deliveries is the same across hospitals.

H1: At least one hospital differs in its cesarean rate.

## 
##  Pearson's Chi-squared test
## 
## data:  tab_osp_parto
## X-squared = 0.95076, df = 2, p-value = 0.6216

The p-value is > 0.05, so there is no statistical evidence to reject the equality of the proportions of cesarean sections. In other words, from the available data, the 3 hospitals do not show significant differences in terms of frequency of caesarean sections.

Hypotesis 2:

H0: The proportion of cesarean deliveries is the same across hospitals.

H1: At least one hospital differs in its cesarean rate.

## 
##  One Sample t-test
## 
## data:  neonati_clean$Peso
## t = 10.624, df = 2448, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 3200
## 95 percent confidence interval:
##  3285.235 3323.820
## sample estimates:
## mean of x 
##  3304.527
## 
##  One Sample t-test
## 
## data:  neonati_clean$Lunghezza
## t = 945.76, df = 2448, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 50
## 95 percent confidence interval:
##  495.0602 496.9096
## sample estimates:
## mean of x 
##  495.9849

p-value < 0.05, suggesting that the mean of our sample is different from the hypothesized parameters. This implies that newborns in the dataset, on average, do not exactly match the standard values ​​of 3200 g and/or 50 cm, with possible clinical implications.

Hypotesis 3

H0: Mean(Peso or Lunghezza) is the same for M and F.

H1: At least one differs.

## 
##  Welch Two Sample t-test
## 
## data:  Peso by Sesso
## t = -12.188, df = 2446.1, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group F and group M is not equal to 0
## 95 percent confidence interval:
##  -270.3458 -195.4077
## sample estimates:
## mean in group F mean in group M 
##        3188.517        3421.393
## 
##  Welch Two Sample t-test
## 
## data:  Lunghezza by Sesso
## t = -9.9083, df = 2434.3, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group F and group M is not equal to 0
## 95 percent confidence interval:
##  -10.975379  -7.348861
## sample estimates:
## mean in group F mean in group M 
##        491.4207        500.5828

p-value < 0.05, therefore there is a statistically significant difference in weight and length between males and females, in line with many researches documenting slight anthropometric differences by sex at birth.

Building a linear regression model

Linear Regression Model

## 
## Call:
## lm(formula = Peso ~ Anni.madre + Gestazione + N.gravidanze + 
##     Fumatrici + Sesso + Ospedale + Tipo.parto, data = neonati_clean)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1477.66  -276.38   -18.78   260.85  1885.89 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -2643.814    210.268 -12.574  < 2e-16 ***
## Anni.madre        2.177      1.740   1.251  0.21096    
## Gestazione      147.279      5.104  28.856  < 2e-16 ***
## N.gravidanze     35.192      8.545   4.118 3.94e-05 ***
## Fumatrici1     -114.589     41.197  -2.781  0.00545 ** 
## SessoM          167.497     16.645  10.063  < 2e-16 ***
## Ospedaleosp2      2.022     20.239   0.100  0.92044    
## Ospedaleosp3     18.354     20.296   0.904  0.36593    
## Tipo.partoNat    22.767     18.120   1.256  0.20907    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 407.9 on 2440 degrees of freedom
## Multiple R-squared:  0.3004, Adjusted R-squared:  0.2981 
## F-statistic: 130.9 on 8 and 2440 DF,  p-value: < 2.2e-16

Regression Model with interactions

## 
## Call:
## lm(formula = Peso ~ Anni.madre + Gestazione * Sesso + N.gravidanze + 
##     Fumatrici + Ospedale + Tipo.parto, data = neonati_clean)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1474.06  -276.47   -19.91   261.13  1887.67 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       -2507.180    274.584  -9.131  < 2e-16 ***
## Anni.madre            2.181      1.740   1.254  0.21001    
## Gestazione          143.745      6.850  20.986  < 2e-16 ***
## SessoM             -139.684    397.314  -0.352  0.72519    
## N.gravidanze         35.143      8.546   4.112 4.05e-05 ***
## Fumatrici1         -115.603     41.222  -2.804  0.00508 ** 
## Ospedaleosp2          2.340     20.245   0.116  0.90798    
## Ospedaleosp3         18.990     20.315   0.935  0.34998    
## Tipo.partoNat        23.191     18.130   1.279  0.20097    
## Gestazione:SessoM     7.859     10.155   0.774  0.43911    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 407.9 on 2439 degrees of freedom
## Multiple R-squared:  0.3005, Adjusted R-squared:  0.298 
## F-statistic: 116.4 on 9 and 2439 DF,  p-value: < 2.2e-16

Interactions: linear_model_interactions adds, for example, Gestation * Sex, testing whether the effect of gestation length varies by sex.

Both ways reduction

## Start:  AIC=29451.11
## Peso ~ Anni.madre + Gestazione + N.gravidanze + Fumatrici + Sesso + 
##     Ospedale + Tipo.parto
## 
##                Df Sum of Sq       RSS   AIC
## - Ospedale      2    164928 406161909 29448
## - Anni.madre    1    260506 406257487 29451
## - Tipo.parto    1    262678 406259659 29451
## <none>                      405996981 29451
## - Fumatrici     1   1287312 407284293 29457
## - N.gravidanze  1   2822268 408819249 29466
## - Sesso         1  16850123 422847104 29549
## - Gestazione    1 138549431 544546412 30168
## 
## Step:  AIC=29448.11
## Peso ~ Anni.madre + Gestazione + N.gravidanze + Fumatrici + Sesso + 
##     Tipo.parto
## 
##                Df Sum of Sq       RSS   AIC
## - Anni.madre    1    269445 406431354 29448
## - Tipo.parto    1    271901 406433810 29448
## <none>                      406161909 29448
## + Ospedale      2    164928 405996981 29451
## - Fumatrici     1   1305624 407467533 29454
## - N.gravidanze  1   2871020 409032929 29463
## - Sesso         1  16873019 423034928 29546
## - Gestazione    1 138817827 544979736 30166
## 
## Step:  AIC=29447.73
## Peso ~ Gestazione + N.gravidanze + Fumatrici + Sesso + Tipo.parto
## 
##                Df Sum of Sq       RSS   AIC
## - Tipo.parto    1    278357 406709711 29447
## <none>                      406431354 29448
## + Anni.madre    1    269445 406161909 29448
## + Ospedale      2    173867 406257487 29451
## - Fumatrici     1   1327235 407758589 29454
## - N.gravidanze  1   4221000 410652354 29471
## - Sesso         1  16949646 423381000 29546
## - Gestazione    1 139220821 545652175 30167
## 
## Step:  AIC=29447.41
## Peso ~ Gestazione + N.gravidanze + Fumatrici + Sesso
## 
##                Df Sum of Sq       RSS   AIC
## <none>                      406709711 29447
## + Tipo.parto    1    278357 406431354 29448
## + Anni.madre    1    275901 406433810 29448
## + Ospedale      2    183531 406526181 29450
## - Fumatrici     1   1301092 408010803 29453
## - N.gravidanze  1   4132307 410842018 29470
## - Sesso         1  16936415 423646126 29545
## - Gestazione    1 139157084 545866796 30166
## 
## Call:
## lm(formula = Peso ~ Gestazione + N.gravidanze + Fumatrici + Sesso, 
##     data = neonati_clean)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1477.05  -274.13   -19.28   261.66  1864.69 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -2537.500    198.084 -12.810  < 2e-16 ***
## Gestazione     146.614      5.070  28.918  < 2e-16 ***
## N.gravidanze    39.187      7.864   4.983 6.69e-07 ***
## Fumatrici1    -115.132     41.175  -2.796  0.00521 ** 
## SessoM         167.892     16.642  10.088  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 407.9 on 2444 degrees of freedom
## Multiple R-squared:  0.2991, Adjusted R-squared:  0.298 
## F-statistic: 260.8 on 4 and 2444 DF,  p-value: < 2.2e-16

Stepwise model: gradually eliminates (or adds) non-significant variables based on AIC, simplifying the model. If some variables are not statistically significant, they will be removed, resulting in a more parsimonious model.

Multiple linear regression model with interactions

Full Linear model

## 
## Call:
## lm(formula = Peso ~ Anni.madre + Gestazione + N.gravidanze + 
##     Fumatrici + Sesso + Ospedale + Tipo.parto, data = neonati_clean)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1477.66  -276.38   -18.78   260.85  1885.89 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -2643.814    210.268 -12.574  < 2e-16 ***
## Anni.madre        2.177      1.740   1.251  0.21096    
## Gestazione      147.279      5.104  28.856  < 2e-16 ***
## N.gravidanze     35.192      8.545   4.118 3.94e-05 ***
## Fumatrici1     -114.589     41.197  -2.781  0.00545 ** 
## SessoM          167.497     16.645  10.063  < 2e-16 ***
## Ospedaleosp2      2.022     20.239   0.100  0.92044    
## Ospedaleosp3     18.354     20.296   0.904  0.36593    
## Tipo.partoNat    22.767     18.120   1.256  0.20907    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 407.9 on 2440 degrees of freedom
## Multiple R-squared:  0.3004, Adjusted R-squared:  0.2981 
## F-statistic: 130.9 on 8 and 2440 DF,  p-value: < 2.2e-16

Full Linear model with interactions

## 
## Call:
## lm(formula = Peso ~ Anni.madre + Gestazione * Sesso + N.gravidanze + 
##     Fumatrici + Ospedale + Tipo.parto, data = neonati_clean)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1474.06  -276.47   -19.91   261.13  1887.67 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       -2507.180    274.584  -9.131  < 2e-16 ***
## Anni.madre            2.181      1.740   1.254  0.21001    
## Gestazione          143.745      6.850  20.986  < 2e-16 ***
## SessoM             -139.684    397.314  -0.352  0.72519    
## N.gravidanze         35.143      8.546   4.112 4.05e-05 ***
## Fumatrici1         -115.603     41.222  -2.804  0.00508 ** 
## Ospedaleosp2          2.340     20.245   0.116  0.90798    
## Ospedaleosp3         18.990     20.315   0.935  0.34998    
## Tipo.partoNat        23.191     18.130   1.279  0.20097    
## Gestazione:SessoM     7.859     10.155   0.774  0.43911    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 407.9 on 2439 degrees of freedom
## Multiple R-squared:  0.3005, Adjusted R-squared:  0.298 
## F-statistic: 116.4 on 9 and 2439 DF,  p-value: < 2.2e-16

AIC/BIC Features Selection

We will use two feature selections algorithms to select a subset of the original features in the dateset, the model we will choose from will be:

• initial_model: combines the variables and the possible Gestation:Sex interaction.

• AIC_model: reduces complexity by penalizing less than BIC, often by including more variables.

• BIC_model: penalizes more strongly, preferring simpler models. Prefered for interpretability.
## 
## Call:
## lm(formula = Peso ~ Gestazione + N.gravidanze + Fumatrici + Sesso, 
##     data = neonati_clean)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1477.05  -274.13   -19.28   261.66  1864.69 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -2537.500    198.084 -12.810  < 2e-16 ***
## Gestazione     146.614      5.070  28.918  < 2e-16 ***
## N.gravidanze    39.187      7.864   4.983 6.69e-07 ***
## Fumatrici1    -115.132     41.175  -2.796  0.00521 ** 
## SessoM         167.892     16.642  10.088  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 407.9 on 2444 degrees of freedom
## Multiple R-squared:  0.2991, Adjusted R-squared:  0.298 
## F-statistic: 260.8 on 4 and 2444 DF,  p-value: < 2.2e-16
## 
## Call:
## lm(formula = Peso ~ Gestazione + N.gravidanze + Fumatrici + Sesso, 
##     data = neonati_clean)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1477.05  -274.13   -19.28   261.66  1864.69 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -2537.500    198.084 -12.810  < 2e-16 ***
## Gestazione     146.614      5.070  28.918  < 2e-16 ***
## N.gravidanze    39.187      7.864   4.983 6.69e-07 ***
## Fumatrici1    -115.132     41.175  -2.796  0.00521 ** 
## SessoM         167.892     16.642  10.088  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 407.9 on 2444 degrees of freedom
## Multiple R-squared:  0.2991, Adjusted R-squared:  0.298 
## F-statistic: 260.8 on 4 and 2444 DF,  p-value: < 2.2e-16
##                  df      AIC
## modello_iniziale 11 36404.47
## modello_AIC       6 36399.37
## modello_BIC       6 36399.37
##                  df      BIC
## modello_iniziale 11 36468.31
## modello_AIC       6 36434.19
## modello_BIC       6 36434.19

CONCLUSIONS:

In the project we evaluated the regression models by comparing the AIC and BIC values. In summary:

  1. We start from an initial model (complete with all variables and, if desired, interactions).

  2. stepAIC() is applied to minimize the AIC, obtaining a model (for example called model_AIC) in which variables are removed or added according to the Akaike Information Criterion.

  3. StepAIC(…, k = log(n)) is applied to minimize the BIC, obtaining a model (called model_BIC) that tends to be more parsimonious, because the BIC (Bayesian Information Criterion) penalizes the number of parameters more.

To compare the models and choose the best one, we considered:

  1. Fit index: in particular R^2 (or Adjusted R^2) and the standard error of the residuals.

  2. Information criteria: AIC and BIC (penalize the complexity of the model).

  3. Simplicity and interpretability: a model with fewer variables can be preferred, if the loss of accuracy is limited.

Conclusions on the “Best” model:

  1. Accuracy: Adj. R^2 is approximately 0.298 in all models, a sign that the explanatory power changes little between the full and the reduced version.

  2. Complexity penalty: AIC and BIC support the reduced version, in which 4 variables remain (Gestation, No. pregnancies, Smokers, Sex).

  3. Interpretability: the reduced model is more parsimonious (fewer coefficients) and maintains practically the same explanatory power.

In practice:

• Model stepwise = AIC = BIC, approximately 36399.37.

• Models full or with interaction => Adj. R^2 similar but with more non-significant variables, and slightly higher AIC.

Models performances Comparison

Confronto tra modelli di regressione
Model AIC BIC R2 Adj_R2 RMSE
Full Model 36403.07 36461.11 0.3 0.3 407.16
Interaction Model 36404.47 36468.31 0.3 0.3 407.11
Stepwise 36399.37 36434.19 0.3 0.3 407.52
AIC Model 36399.37 36434.19 0.3 0.3 407.52
BIC Model 36399.37 36434.19 0.3 0.3 407.52

Model Quality Assessment

## 
## Call:
## lm(formula = Peso ~ Gestazione + N.gravidanze + Fumatrici + Sesso, 
##     data = neonati_clean)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1477.05  -274.13   -19.28   261.66  1864.69 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -2537.500    198.084 -12.810  < 2e-16 ***
## Gestazione     146.614      5.070  28.918  < 2e-16 ***
## N.gravidanze    39.187      7.864   4.983 6.69e-07 ***
## Fumatrici1    -115.132     41.175  -2.796  0.00521 ** 
## SessoM         167.892     16.642  10.088  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 407.9 on 2444 degrees of freedom
## Multiple R-squared:  0.2991, Adjusted R-squared:  0.298 
## F-statistic: 260.8 on 4 and 2444 DF,  p-value: < 2.2e-16
## [1] 407.519

##         StudRes         Hat        CookD
## 310  -0.3100152 0.019541517 0.0003832531
## 1385 -0.2330505 0.016488238 0.0001821767
## 1553  4.3665777 0.006150757 0.0234272751
## 1920  4.6147906 0.010720106 0.0457743482

Example data test

Selected Model test in a hypotetical scenario: Gestazione: 39 weeks, N.gravidanze: 3, Fumatrici: non-smoker (“0”), Sesso: Female (“F”)
Prediction of Newborn Weight Using the Selected Model (modello_BIC)
Scenario Predicted_Weight Lower_PI Upper_PI
Mother on 3rd pregnancy, 39 weeks gestation, non-smoker, female 3298 2497.1 4098.9