Preliminary Analysis

Dataset presentation

The dataset presents 2500 observations, for 10 variables:
* Anni.madre: continuous quantitative variable ;
* N.gravidanze: discrete quantitative variable ;
* Fumatrice: Factor/dummy variable: “0” means non-smoker while “1” means smoker ;
* Gestazione: continuous quantitative variable ;
* Peso: continuous quantitative variable ;
* Lunghezza: continuous quantitative variable ;
* Cranio: continuous quantitative variable ;
* Tipo.parto: qualitative variable on nominal scale with 2 modalities: “Ces” and “Nat” ;
* Ospedale: qualitative variable on nominal scale with 3 modalities: “osp1”. “osp2” and “osp3” ;
* Sesso: qualitative variable on nominal scale with 2 modalities: “F” and “M”.

In this study, “Peso” is the response variable, while the others are predictor variables.
In particular “Lunghezza” and “Cranio” are anthropometric data, used as control variables.

Description of quantitative variables

data <- read.csv("neonati.csv", sep=",", stringsAsFactors = T)
n<-dim(data)[1]
attach(data)

matrix_data <- matrix(nrow = 6, ncol = 6, byrow=F)

for (j in 1:6) {
  matrix_data[,j] <- c(summary(Anni.madre)[j], summary(N.gravidanze)[j], summary(Gestazione)[j], summary(Peso)[j], summary(Lunghezza)[j], summary(Cranio)[j])
  }

library(DT) 
datatable(round(matrix_data, digits =2), 
  colnames = c("Min", "1st. Quart", "Median", "Mean", "3rd Quart", "Max"),
  rownames = c("Anni.madre", "N.gravidanze", "Gestazione", "Peso", "Lunghezza", "Cranio")
)

The first variable, mother’s age, presents a minimum at 0 years old. To deeply investigate, a plot is done, in order to highlight eventual outliers/anomalies on low mother’s age value.

library(ggplot2)
ggplot(data,aes(x=1:n,y=Anni.madre,label=ifelse(Anni.madre<(quantile(Anni.madre, 0.25) - 1.5 * IQR(Anni.madre)), Anni.madre ,"")))+
  geom_point(size=1,color = ifelse(Anni.madre < (quantile(Anni.madre, 0.25) - 1.5 * IQR(Anni.madre)), "red", "black"))+
  geom_text(aes(colour="red"), vjust = 1.5, show.legend=F)+
  labs(title="Mother's age with highlight on abnormal values", x="", y="Mother's age")+
  theme_light()

It appears that among the 5 values identified as outliers, 2 are errors with ages of 0 and 1 year old.

These statistic units are removed from that database, as they can’t correspond to true values.

new_data <- data[ !(Anni.madre %in% c(0, 1)), ]
n<-dim(new_data)[1]
detach(data)
attach(new_data)

Going back to the recapitulative table of quantitative variables, there are no abnormal values in the min and max. To fully describe them, some plots are performed.

Number of previous pregnancies

fi <- table(N.gravidanze)/n*100

barplot(fi, 
        main ="Number of previous pregnancies" ,
        xlab ="Number of pregnancies" ,
        ylab ="Relative frequencies, in %",
        ylim = c(0,50),
        cex.axis = 0.7,
        cex.names= 0.7,
        col= "mediumaquamarine")

For more than 40% of the mothers, it is a first pregnancy. The mother with major previous pregnancies had 12.
The mean value is 0.98 pregnancy, very close value to 1. So, on average it is the second pregnancy.

Number of weeks of gestation

fi <- table(Gestazione)/n*100

barplot(fi, 
        main ="Number of weeks of gestation" ,
        xlab ="Number of weeks" ,
        ylab ="Relative frequencies, in %",
        ylim = c(0,40),
        cex.axis = 0.7,
        cex.names= 0.7,
        col= "mediumaquamarine")

For nearly 30% of the births, it happens at the 40th week, which corresponds to the theoretical due date of the gestation. According to the summary table, the mean and median values are equal to 39 weeks. The minimum is 25 weeks, the maximum is 43 weeks. There is no abnormal value here.

Weight, length and cranium size

For these 3 continuous quantitative variables, we plot here their densities. As many variables found in nature, we expect Gaussian distributions.

par(mfrow=c(1,3))

plot(density(Peso),
     main = "Density of Weight",
     xlab = "Weight, g",
     ylab = "Density",
     xlim = c(min(Peso), max(Peso)))
abline(v=median(Peso), col=2)
text(x=median(Peso)+600, y=0.0008, labels = median(Peso), col=2, cex = 1.3)

plot(density(Lunghezza),
     main = "Density of Length",
     xlab = "Length, cm",
     ylab = "Density",
     xlim = c(min(Lunghezza), max(Lunghezza)))
abline(v=median(Lunghezza), col=2)
text(x=median(Lunghezza)+30, y=0.02, labels = median(Lunghezza), col=2, cex = 1.3)

plot(density(Cranio),
     main = "Density of Cranium diameter",
     xlab = "Cranium diameter, mm",
     ylab = "Density",
     xlim = c(min(Cranio), max(Cranio)))
abline(v=median(Cranio), col=2)
text(x=median(Cranio)+25, y=0.026, labels = median(Cranio), col=2, cex = 1.3)

Shape factors:

library(moments)
matrix_shape <- round(matrix(c
                       (skewness(Peso), skewness(Lunghezza), skewness(Cranio), 
                         kurtosis(Peso)-3, kurtosis(Lunghezza)-3, kurtosis(Cranio)-3),
nrow = 3, ncol = 2, byrow=F), 2)

datatable(matrix_shape, 
  colnames = c("Skewness", "Kurtosis"),
  rownames = c("Weight", "Length", "Cranium diameter"))

Compared with Gaussian distributions, these 3 ones have negative skewnesses and are leptokurtic.

Description of qualitative variables

Smoking

fi_S <- round(table(Fumatrici)/n, 2)*100
print(fi_S)
## Fumatrici
##  0  1 
## 96  4

In this survey, 96% of the mothers are non-smoking, while 4% smoke.

Type of delivery

fi_D <- round(table(Tipo.parto)/n, 2)*100
print(fi_D)
## Tipo.parto
## Ces Nat 
##  29  71

71% of the mothers had a natural delivery, while 29% a C-section.

Hospital

fi_H <- round(table(Ospedale)/n, 2)*100
print(fi_H)
## Ospedale
## osp1 osp2 osp3 
##   33   34   33

There is absolute equality: the survey considered the same number of mothers in each of the 3 hospitals.

Newborn sex

fi_Sex <- round(table(Sesso)/n, 2)*100
print(fi_Sex)
## Sesso
##  F  M 
## 50 50

There is equality between girls and boys.

Hypothesis to test

Cesarean delivery is performed more often in some hospital

We want to know if the 2 qualitative variables “Ospedale” and “Tipo.parto” are independent. A Chi-squared test is performed.

table_Osp_Par <- table(Ospedale, Tipo.parto)
test_Osp_Par <- chisq.test(table_Osp_Par)
print(test_Osp_Par)
## 
##  Pearson's Chi-squared test
## 
## data:  table_Osp_Par
## X-squared = 1.083, df = 2, p-value = 0.5819

p-value is equal to 0.58, much higher than 5%, so we can’t refuse H0 hypothesis: there is independency of these 2 variables, so there is no hospital with major numbers of C-sections.

Weight and length means compared to population values

According to the Stanford Medicine Children’s Health website (https://www.stanfordchildrens.org/en/topic/default?id=newborn-measurements-90-P02673), the population mean values are:
* Mean weight: 3.200kg
* Mean length: 50 cm

mean(Peso);mean(Lunghezza)
## [1] 3284.184
## [1] 494.6958

To test if weight and length values of our sample are significantly different from the population ones, we perform a t.test.

Weight
t.test(Peso, mu=3200, conf.level=0.95, alternative = "two.sided")
## 
##  One Sample t-test
## 
## data:  Peso
## t = 8.0108, df = 2497, p-value = 1.731e-15
## alternative hypothesis: true mean is not equal to 3200
## 95 percent confidence interval:
##  3263.577 3304.791
## sample estimates:
## mean of x 
##  3284.184

Even if population mean (3200 g) and sample mean (3284g) are close values, the t-test suggests that these data are significantly different (p-value very close to 0, the H0 hypothesis of equality is rejected).

Length
t.test(Lunghezza, mu=500, conf.level=0.95, alternative = "two.sided")
## 
##  One Sample t-test
## 
## data:  Lunghezza
## t = -10.069, df = 2497, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 500
## 95 percent confidence interval:
##  493.6628 495.7287
## sample estimates:
## mean of x 
##  494.6958

In the same way, the t-test here establishs a relevant difference between population mean (500 cm) and sample mean (495 cm), with a significativity level of 5%.

Significative difference of anthropometric data between male and female

Cranium diameter
library(RColorBrewer)
ggplot(new_data)+
  geom_boxplot(aes(x = Sesso, y = Cranio, fill = Sesso))+
  labs(title = "Cranium diameter for M vs F",
       x = " ",
       y = "Cranium diameter, mm")+
  theme_light()+
  scale_fill_brewer(palette="Pastel1")

A t-test to compare group mean values is performed.

t.test(data = new_data, Cranio~Sesso)
## 
##  Welch Two Sample t-test
## 
## data:  Cranio by Sesso
## t = -7.4366, df = 2489.4, p-value = 1.414e-13
## alternative hypothesis: true difference in means between group F and group M is not equal to 0
## 95 percent confidence interval:
##  -6.110504 -3.560417
## sample estimates:
## mean in group F mean in group M 
##        337.6231        342.4586

The p-value is very low, so H0 hypothesis is rejected: there is relevant difference of cranium data between male and female.

Length
ggplot(new_data)+
  geom_boxplot(aes(x = Sesso, y = Lunghezza, fill = Sesso))+
  labs(title = "Length for M vs F",
       x = " ",
       y = "Length, mm")+
  theme_light()+
  scale_fill_brewer(palette="Pastel1")

A t-test to compare group mean values is performed.

t.test(data = new_data, Lunghezza~Sesso)
## 
##  Welch Two Sample t-test
## 
## data:  Lunghezza by Sesso
## t = -9.5823, df = 2457.3, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group F and group M is not equal to 0
## 95 percent confidence interval:
##  -11.939001  -7.882672
## sample estimates:
## mean in group F mean in group M 
##        489.7641        499.6750

As a conclusion, it appears that both cranium diameter and length are significantly different in respect of the newborn sex.

Regression Model Creation

As preamble, we build a correlation matrix, together with the scatterplots of variables per pairs.

panel.cor <- function(x, y, digits = 2, prefix = "", cex.cor, ...)
{
    par(usr = c(0, 1, 0, 1))
    r <- abs(cor(x, y))
    txt <- format(c(r, 0.123456789), digits = digits)[1]
    txt <- paste0(prefix, txt)
    if(missing(cex.cor)) cex.cor <- 0.8/strwidth(txt)
    text(0.5, 0.5, txt, cex = cex.cor * r)
}

pairs(new_data, upper.panel = panel.smooth, lower.panel = panel.cor)

First, we observe the correlations between the response variable weight and the other quantitative variables:

Weight and mother age

cor(Peso, Anni.madre)
## [1] -0.02378138

As the correlation coefficient is very close to 0, we can say that, based on these sample data, there is no correlation between weight and mother age. Nevertheless, as age is a control variable, and is usual to be considered as very important about pregnancy, at first, it will be kept in the regression model.

Weight and number of previous pregnancies

cor(Peso, N.gravidanze)
## [1] 0.002277118

With this very low value, it can be said that newborn’s weight and number of previous pregnancies are not correlated variables.

Weight and gestation duration

cor(Peso, Gestazione)
## [1] 0.5919592

With a correlation coefficient of nearly 0.6, it seems clear that these 2 variables are correlated: the newborn weight is higher if it stays longer in his/her mother’s womb.

A plot is done to visualize this interaction.

ggplot(new_data)+
  geom_point(aes(x=Gestazione, y=Peso), position = "jitter", size = 0.5)+
  geom_smooth(aes(x=Gestazione, y=Peso))+
  labs(title= "Weight = f(Weeks of Gestation)",
       x = "Weeks of gestation",
       y = "Weight, g")+
  theme_light()

We observe a slighlty non-linear trend here.

Weight and anthropomorphic measurements

Weight and length

cor(Peso, Lunghezza)
## [1] 0.7960415

As logically expected, the weight and length variables are highly correlated. Let’s plot this relation to investigate if it is linear or not.

ggplot(new_data)+
  geom_point(aes(x=Lunghezza, y=Peso), size = 0.5, position = "jitter")+
  geom_smooth(aes(x=Lunghezza, y=Peso))+
  labs(title= "Weight = f(Length)",
       x = "Length, mm",
       y = "Weight, g")+
  theme_light()

From around Length = 370 mm, the relation between weight and length seems to be linear. A point draws our attention:

which(Lunghezza == 315) ; Peso[1549]
## [1] 1549
## [1] 4370

It would corresponds to a newborn with atypical features: small length of 315 mm and important weight of 4.37 kg.

So far we keep this data, a further analysis with Cook distance will determine if it corresponds to an abnormal value.

Weight and cranium diameter

cor(Peso, Cranio)
## [1] 0.7048438

As for length, there is a strong correlation between weight and cranium diameter.

ggplot(new_data)+
  geom_point(aes(x=Cranio, y=Peso), size = 0.5)+
  geom_smooth(aes(x=Cranio, y=Peso))+
  labs(title= "Weight = f(Cranium diameter)",
       x = "Cranium diameter, mm",
       y = "Weight, g")+
  theme_light()

The relation doesn’t seem to be perfectly linear, next, some quadratic effect can be investigated.

Then, we study correlations between the response variable “Peso” and the qualitative variables.

Smoking mother behaviour

ggplot(new_data)+
  geom_boxplot(aes(x = factor(Fumatrici), y = Peso, fill = factor(Fumatrici)))+
  labs(title = "Weight = f(Smoking)",
       x = " ",
       y = "Weight, g")+
  theme_light()+
  scale_fill_brewer(palette="Set2")

A t-test is done in order to test the correlation between these 2 variables.

t.test(Peso~Fumatrici)
## 
##  Welch Two Sample t-test
## 
## data:  Peso by Fumatrici
## t = 1.0362, df = 114.12, p-value = 0.3023
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
##  -45.5076 145.3399
## sample estimates:
## mean in group 0 mean in group 1 
##        3286.262        3236.346

p-value equals to 30%: it is not possible to reject H0, so we don’t conclude about a relevant difference of weight mean in respect of the smoking mother behaviour.

Delivery type

ggplot(new_data)+
  geom_boxplot(aes(x = Tipo.parto, y = Peso, fill = Tipo.parto))+
  labs(title = "Weight = f(Delivery type)",
       x = " ",
       y = "Weight, g")+
  theme_light()+
  scale_fill_brewer(palette="Set3")

A t-test is done in order to test the correlation between these 2 variables.

t.test(Peso~Tipo.parto)
## 
##  Welch Two Sample t-test
## 
## data:  Peso by Tipo.parto
## t = -0.13626, df = 1494.4, p-value = 0.8916
## alternative hypothesis: true difference in means between group Ces and group Nat is not equal to 0
## 95 percent confidence interval:
##  -46.44246  40.40931
## sample estimates:
## mean in group Ces mean in group Nat 
##          3282.047          3285.063

Weight means of the 2 groups divided by delivery type are very close: in fact as visible on the boxplot it doesn’t seem to be any difference, and this result is confirmed by the t-test which, with p-value = 89%, absolutly accept the hypothesis of means equality.

For this reason, type of delivery will not be considered in the regression model.

Hospital

No correlation is expected between hospital and weight, but a t-test is done to confirm this hypothesis.

pairwise.t.test(Peso, Ospedale, paired = F, pool.sd = T, p.adjust.method = "bonferroni")
## 
##  Pairwise comparisons using t tests with pooled SD 
## 
## data:  Peso and Ospedale 
## 
##      osp1 osp2
## osp2 1.00 -   
## osp3 0.33 0.32
## 
## P value adjustment method: bonferroni

p-values are >30%, we accept the hypothesis of means equality: there is no correlation between weight value and the place of birth.

This qualitative variable will not be taken in account for the model, as well as type delivery.

Newborn sex

ggplot(new_data)+
  geom_boxplot(aes(x = Sesso, y = Peso, fill = Sesso))+
  labs(title = "Weight = f(Newborn sex)",
       x = " ",
       y = "Weight, g")+
  theme_light()+
  scale_fill_brewer(palette="Pastel1")

A difference is visible between female weight and male weight. We test this hypothesis with a t-test.

t.test(Peso~Sesso)
## 
##  Welch Two Sample t-test
## 
## data:  Peso by Sesso
## t = -12.115, df = 2488.7, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group F and group M is not equal to 0
## 95 percent confidence interval:
##  -287.4841 -207.3844
## sample estimates:
## mean in group F mean in group M 
##        3161.061        3408.496

As p-value is close to 0, we conclude about the relevant difference of weight means between male and female.

Regression Model Creation

Model number 1

A first model is created, with all variables except delivery type and hospital.

mod1 <- lm(Peso~Anni.madre+N.gravidanze+Fumatrici+Gestazione+Lunghezza+Cranio+Sesso, data = new_data)
summary(mod1)
## 
## Call:
## lm(formula = Peso ~ Anni.madre + N.gravidanze + Fumatrici + Gestazione + 
##     Lunghezza + Cranio + Sesso, data = new_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1160.6  -181.3   -15.7   163.6  2630.7 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -6712.2405   141.3339 -47.492  < 2e-16 ***
## Anni.madre       0.8803     1.1491   0.766    0.444    
## N.gravidanze    11.3789     4.6767   2.433    0.015 *  
## Fumatrici      -30.3958    27.6080  -1.101    0.271    
## Gestazione      32.9472     3.8288   8.605  < 2e-16 ***
## Lunghezza       10.2316     0.3011  33.979  < 2e-16 ***
## Cranio          10.5198     0.4271  24.633  < 2e-16 ***
## SessoM          78.0787    11.2132   6.963 4.24e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 274.7 on 2490 degrees of freedom
## Multiple R-squared:  0.7272, Adjusted R-squared:  0.7264 
## F-statistic: 948.3 on 7 and 2490 DF,  p-value: < 2.2e-16

This model explains nearly 73% of the variability (adjusted R-squared).

As seen before, mother age and smoking habits are not correlated to the weight (p-values of 44% and 27%).

About number of previous pregnancies: p-value of around 1.5%, so can be considered as significant. Beta coefficient equals to 11.38: it means that, all other things being equal, for each previous pregnancy of the mother, the newborn weight increases of 11 g.

To visualize this interaction, hereafter the corresponding plot.

  library(dplyr)
  by_Preg <-new_data %>%
    group_by(N.gravidanze)%>%
    summarise(media = mean(Peso))

  ggplot(by_Preg, aes(x=N.gravidanze, y=media))+
  geom_col(fill="mediumaquamarine", col="black")+
  labs(title="Newborn weight vs number of previous pregnancies",
       x="Number of previous pregnancies",
       y="Weight, in g")+
  theme_light()

The interaction is not obvious: it seems that it is mainly influenced by the data obtained for mothers with 7 previous pregnancies.

table(N.gravidanze)
## N.gravidanze
##    0    1    2    3    4    5    6    7    8    9   10   11   12 
## 1095  817  340  150   48   21   11    1    8    2    3    1    1

With 7 previous pregnancies there is an only one mother. So we decide to remove this variable in next model.

About the others variables:

  • Number of weeks of gestation: with a p-value close to 0, the result is very significant. The beta coefficient is about 33, it means that for each week more of gestation, the newborn gains 33g.

  • Anthropomorphic data: both cranium diameter and length present a p-value close to 0, so a very relevant piece of information. Both have beta coefficients of around 10: it means that for each mm more of cranium diameter or of length corresponds an increase of 10 g for the newborn.

  • Newborn sex: with a very low p-value, sex is a relevant variable here. The beta coefficient of nearly 78 indicates that meanly, the male newborns weight 78 g more than females.

Model number 2

In the light of this information, a new model is created without the number of previous pregnancies, the mother age, and the smoking mother habits.

mod2 <- update(mod1, ~. -N.gravidanze-Anni.madre-Fumatrici, data = new_data)
summary(mod2)
## 
## Call:
## lm(formula = Peso ~ Gestazione + Lunghezza + Cranio + Sesso, 
##     data = new_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1138.1  -184.4   -17.4   163.3  2626.7 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -6651.6732   135.5952 -49.055  < 2e-16 ***
## Gestazione     31.3262     3.7884   8.269  < 2e-16 ***
## Lunghezza      10.2024     0.3009  33.909  < 2e-16 ***
## Cranio         10.6706     0.4247  25.126  < 2e-16 ***
## SessoM         79.1027    11.2205   7.050 2.31e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 275.1 on 2493 degrees of freedom
## Multiple R-squared:  0.7261, Adjusted R-squared:  0.7257 
## F-statistic:  1652 on 4 and 2493 DF,  p-value: < 2.2e-16

The beta coefficients didn’t change too much, and the adjusted R.squared is unchanged as well.

Model number 3

Now we consider quadratic effects, in particular for variable where the scatterplot shows non linear trend as number of weeks of gestation.

mod3 <- update(mod2, ~. +I(Gestazione^2))
summary(mod3)
## 
## Call:
## lm(formula = Peso ~ Gestazione + Lunghezza + Cranio + Sesso + 
##     I(Gestazione^2), data = new_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1132.76  -181.60   -15.82   162.79  2649.15 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -4640.6017   899.9572  -5.156 2.71e-07 ***
## Gestazione        -80.9464    49.8136  -1.625   0.1043    
## Lunghezza          10.3056     0.3041  33.892  < 2e-16 ***
## Cranio             10.7671     0.4265  25.247  < 2e-16 ***
## SessoM             76.9130    11.2530   6.835 1.03e-11 ***
## I(Gestazione^2)     1.4988     0.6631   2.260   0.0239 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 274.9 on 2492 degrees of freedom
## Multiple R-squared:  0.7267, Adjusted R-squared:  0.7261 
## F-statistic:  1325 on 5 and 2492 DF,  p-value: < 2.2e-16

We observe that the effect of gestation at 1st grade becomes insignificant (p-value is up to 10%), while the quadratic effect of the variable has p-value below 5%. Considering the adjusted R-squared which is unchanged, to keep the model as simple as possible mod2 is preferred to mod3.

Model number 4

The cranium diameter variable shows a non linear trend vs weight too. We create mod4 with Cranio variable squared, from mod2.

mod4 <- update(mod2, ~. +I(Cranio^2))
summary(mod4)
## 
## Call:
## lm(formula = Peso ~ Gestazione + Lunghezza + Cranio + Sesso + 
##     I(Cranio^2), data = new_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1127.14  -183.87   -14.67   163.88  2609.99 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   74.00064 1153.55946   0.064    0.949    
## Gestazione    37.78342    3.92069   9.637  < 2e-16 ***
## Lunghezza     10.44188    0.30165  34.616  < 2e-16 ***
## Cranio       -31.40498    7.17967  -4.374 1.27e-05 ***
## SessoM        74.28142   11.17615   6.646 3.68e-11 ***
## I(Cranio^2)    0.06224    0.01060   5.871 4.92e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 273.3 on 2492 degrees of freedom
## Multiple R-squared:  0.7298, Adjusted R-squared:  0.7293 
## F-statistic:  1346 on 5 and 2492 DF,  p-value: < 2.2e-16

Both “Cranio” and “Cranio”^2 are significative, as there p-values are very low, but beta coefficient of “Cranio”^2 is close to 0, so the interaction with weight is not important.

Adjusted R.squared stays equal, so in order to keep the model as simple as possible, “Cranio”^2 is not kept in final model.

So far, mod2 is the best model.

Model number 5

Interactions between variables are studied. We can consider it between length and weeks of gestation.

mod5 <- update(mod2, ~. +Lunghezza*Gestazione)
summary(mod5)
## 
## Call:
## lm(formula = Peso ~ Gestazione + Lunghezza + Cranio + Sesso + 
##     Gestazione:Lunghezza, data = new_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1121.91  -181.73   -13.56   166.26  2639.49 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          -2.031e+03  9.220e+02  -2.202 0.027728 *  
## Gestazione           -9.311e+01  2.485e+01  -3.747 0.000183 ***
## Lunghezza             2.599e-02  2.031e+00   0.013 0.989789    
## Cranio                1.089e+01  4.249e-01  25.639  < 2e-16 ***
## SessoM                7.350e+01  1.122e+01   6.551 6.91e-11 ***
## Gestazione:Lunghezza  2.688e-01  5.305e-02   5.067 4.34e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 273.7 on 2492 degrees of freedom
## Multiple R-squared:  0.7289, Adjusted R-squared:  0.7284 
## F-statistic:  1340 on 5 and 2492 DF,  p-value: < 2.2e-16

Length as single effect is not significative anymore. The improvement of adjusted R-squared is minimal.

Optimal Model Selection

Akaike and Bayes information criteria are calculated for all the created models.

AIC(mod1, mod2, mod3, mod4, mod5)
##      df      AIC
## mod1  9 35155.07
## mod2  6 35159.12
## mod3  7 35156.01
## mod4  7 35126.81
## mod5  7 35135.52

According to AIC data, mod4 is the best.

BIC(mod1, mod2, mod3, mod4, mod5)
##      df      BIC
## mod1  9 35207.48
## mod2  6 35194.06
## mod3  7 35196.77
## mod4  7 35167.58
## mod5  7 35176.28

Based on BIC values, mod4 is the best.

We preferred mod2, while Akaike and Bayes information criteria indicate mod4.

We test stepAIC from MASS package in order to check this result:

# Creation of a model with all variables:
mod6 <- lm(Peso~., data = new_data)

# Procedura stepAIC
stepwise.mod <- MASS :: stepAIC(mod6, direction="both", k=log(n))
## Start:  AIC=28118.61
## Peso ~ Anni.madre + N.gravidanze + Fumatrici + Gestazione + Lunghezza + 
##     Cranio + Tipo.parto + Ospedale + Sesso
## 
##                Df Sum of Sq       RSS   AIC
## - Anni.madre    1     36710 186779904 28111
## - Fumatrici     1     90677 186833870 28112
## - Ospedale      2    687555 187430749 28112
## - N.gravidanze  1    446244 187189438 28117
## - Tipo.parto    1    451073 187194266 28117
## <none>                      186743194 28119
## - Sesso         1   3610705 190353899 28159
## - Gestazione    1   5458852 192202046 28183
## - Cranio        1  45318506 232061700 28654
## - Lunghezza     1  87861708 274604902 29074
## 
## Step:  AIC=28111.28
## Peso ~ N.gravidanze + Fumatrici + Gestazione + Lunghezza + Cranio + 
##     Tipo.parto + Ospedale + Sesso
## 
##                Df Sum of Sq       RSS   AIC
## - Fumatrici     1     91599 186871503 28105
## - Ospedale      2    693914 187473818 28105
## - Tipo.parto    1    452049 187231953 28110
## <none>                      186779904 28111
## - N.gravidanze  1    631082 187410986 28112
## + Anni.madre    1     36710 186743194 28119
## - Sesso         1   3617809 190397713 28151
## - Gestazione    1   5424800 192204704 28175
## - Cranio        1  45569477 232349381 28649
## - Lunghezza     1  87852027 274631931 29066
## 
## Step:  AIC=28104.68
## Peso ~ N.gravidanze + Gestazione + Lunghezza + Cranio + Tipo.parto + 
##     Ospedale + Sesso
## 
##                Df Sum of Sq       RSS   AIC
## - Ospedale      2    702925 187574428 28098
## - Tipo.parto    1    444404 187315907 28103
## <none>                      186871503 28105
## - N.gravidanze  1    608136 187479640 28105
## + Fumatrici     1     91599 186779904 28111
## + Anni.madre    1     37633 186833870 28112
## - Sesso         1   3601860 190473363 28145
## - Gestazione    1   5358199 192229702 28168
## - Cranio        1  45613331 232484834 28642
## - Lunghezza     1  88259386 275130889 29063
## 
## Step:  AIC=28098.41
## Peso ~ N.gravidanze + Gestazione + Lunghezza + Cranio + Tipo.parto + 
##     Sesso
## 
##                Df Sum of Sq       RSS   AIC
## - Tipo.parto    1    467626 188042054 28097
## <none>                      187574428 28098
## - N.gravidanze  1    648873 188223301 28099
## + Ospedale      2    702925 186871503 28105
## + Fumatrici     1    100610 187473818 28105
## + Anni.madre    1     44184 187530244 28106
## - Sesso         1   3644818 191219246 28139
## - Gestazione    1   5457887 193032315 28162
## - Cranio        1  45747094 233321522 28636
## - Lunghezza     1  87955701 275530129 29051
## 
## Step:  AIC=28096.81
## Peso ~ N.gravidanze + Gestazione + Lunghezza + Cranio + Sesso
## 
##                Df Sum of Sq       RSS   AIC
## <none>                      188042054 28097
## - N.gravidanze  1    621053 188663107 28097
## + Tipo.parto    1    467626 187574428 28098
## + Ospedale      2    726146 187315907 28103
## + Fumatrici     1     92548 187949505 28103
## + Anni.madre    1     45366 187996688 28104
## - Sesso         1   3650790 191692844 28137
## - Gestazione    1   5477493 193519547 28161
## - Cranio        1  46098547 234140601 28637
## - Lunghezza     1  87532691 275574744 29044

The model as the best one following this procedure would be the equivalent of the mod2, but keeping the number of previous pregnancies as relevant data. As we have seen, this is mainly due to an only one observation with 7 previous pregnancies. So we decide to keep as the best one our mod2.

Model Quality Analysis

We consider mod2 as our final and best regression model.

RMSE calculation

RMSE_2 <- sqrt(mean(mod2$residuals^2))
print(RMSE_2)
## [1] 274.8193

RMSE measures the average difference between values predicted by the model and the actual values. So using mod2, the average gap between predicted value and reel one is nearly 275g on the variable weight.

Residuals analysis

par(mfrow = c(2,2))
plot(mod2)

Cook distance analysis

On the previous Residuals vs Leverage plot, the 1551 point presents a Cook distance of 1.

Let’s see in details the data corresponding to this observation:

new_data[1549, ]
##      Anni.madre N.gravidanze Fumatrici Gestazione Peso Lunghezza Cranio
## 1551         35            1         0         38 4370       315    374
##      Tipo.parto Ospedale Sesso
## 1551        Nat     osp3     F

It corresponds to a female newborn from a non-smoker mother, who already had a pregnancy, with a gestation of 38 weeks. Her anthropomorphic characteristics seem abnormal: considering the statistics of this database, the lenght is very short (315 mm against a median of 500 mm), the cranium diameter is pretty high (374 mm against a mean of 340 mm), and the weight is high too (4370 g against a mean of 3284 g).

We decide to remove this value, as it is placed on the 1 “alert” line.

new_data <- new_data[-c(1549),] 

Residuals mean value

mean(residuals(mod2))
## [1] 1.72925e-14

As requested for linear regression model, residuals mean is equal to 0.

Residuals Gaussian distribution

library(lmtest)
shapiro.test(residuals(mod2))
## 
##  Shapiro-Wilk normality test
## 
## data:  residuals(mod2)
## W = 0.97426, p-value < 2.2e-16

p-value is close to 0: residuals don’t follow a normal distribution.

Looking at the Q-Q plot, extreme values indeed are out of the bisector plot.

Homoscedasticity

bptest(mod2)
## 
##  studentized Breusch-Pagan test
## 
## data:  mod2
## BP = 89.198, df = 4, p-value < 2.2e-16

As p-value is very low, we reject H0, homoscedasticity hypothesis, even if on the plot, the points seem to be casually placed around an horizontal value.

Residuals non-correlation

dwtest(mod2)
## 
##  Durbin-Watson test
## 
## data:  mod2
## DW = 1.9553, p-value = 0.1318
## alternative hypothesis: true autocorrelation is greater than 0

As p-value is equals to 13%, we don’t reject H0: residuals are not correlated.

Multicollinearity between variables

library(car)
vif(mod2)
## Gestazione  Lunghezza     Cranio      Sesso 
##   1.654101   2.070582   1.606316   1.038918

As required to validate a model, all VIF (Variance Inflation Factors) have values below 5, what signify a non multicollinearity between variables.

Prediction

Weight of a female newborn, from a mother with 2 previous pregnancies, and a gestation of 39 weeks.

predict(mod2, 
        newdata = data.frame(
  Sesso="F", 
  Gestazione=39,
  Lunghezza=mean(Lunghezza), 
  Cranio=mean(Cranio)))
##        1 
## 3245.462

The predicted weight is equal to 3245 g.

Visualization

Weight as function of weeks of gestation and sex or smoking habits

ggplot(data=new_data)+
  geom_point(aes( x = Gestazione, y = Peso, col = Sesso), position = "jitter", size = 0.5)+
  geom_smooth(aes( x = Gestazione, y = Peso, col = Sesso), se = F, method = "lm")+
  labs(title = "Weight as function of weeks of gestation for male and female",
       x = "Gestation, in weeks",
       y= "Weight, in g")+
  theme_light()

Weight newborn increases with weeks of gestation. As male weight is higher as female’s, the curve for male is a little higher than female’s one, but the slope is the same for both.

ggplot(data=new_data)+
  geom_point(aes( x = Gestazione, y = Peso, col = factor(Fumatrici)), position = "jitter", size = 0.5)+
  geom_smooth(aes( x = Gestazione, y = Peso, col = factor(Fumatrici)), se = F, method = "lm")+
  labs(title = "Weight as function of weeks of gestation for smoking and non- mothers",
       x = "Gestation, in weeks",
       y= "Weight, in g")+
  theme_light()+
  scale_color_manual(name = "Legend: ", values=c("aquamarine3", "coral1"), labels = c("Non-smoking", "Smoking"))

There is a lower slope for the data about smoking mothers: for long gestations (beyond 36 weeks), newborn weight gets slightly lower than newborns from non-smoking mothers.

Conclusion

A regression model has been developed in order to explain which variables can affect newborns weight and to predict it. In this study the most useful part is the prediction: it is important to be able to anticipate this data before birth, to organize human and material resources to take care about most fragile newborns, the ones with lower weights.

The most relevant variables are weeks of gestation and sex, and as control variables length and cranium diameter increase with the weight (the baby becomes bigger, so heavier and his measurements grow as well).

As surprising as it may sound, some variables as smoking mother habits don’t influence the weight, as well as mother age.