Statistical Model for Predicting Newborn Weight

Company Context Company: Neonatal Health Solutions

Objective: Create a statistical model capable of accurately predicting newborn birth weight, based on clinical variables collected from three hospitals.

The project aims to improve the management of high-risk pregnancies, optimizing hospital resources and ensuring better outcomes for neonatal health.

The variables collected include:

Maternal Age: Measure of age in years.

kable(summary(dataset), caption="Summary of the variables in the dataset") %>%
  kable_styling()
Summary of the variables in the dataset
Anni.madre N.gravidanze Fumatrici Gestazione Peso Lunghezza Cranio Tipo.parto Ospedale Sesso
Min. : 0.00 Min. : 0.0000 0:2396 Min. :25.00 Min. : 830 Min. :310.0 Min. :235 Length:2500 Length:2500 Length:2500
1st Qu.:25.00 1st Qu.: 0.0000 1: 104 1st Qu.:38.00 1st Qu.:2990 1st Qu.:480.0 1st Qu.:330 Class :character Class :character Class :character
Median :28.00 Median : 1.0000 NA Median :39.00 Median :3300 Median :500.0 Median :340 Mode :character Mode :character Mode :character
Mean :28.16 Mean : 0.9812 NA Mean :38.98 Mean :3284 Mean :494.7 Mean :340 NA NA NA
3rd Qu.:32.00 3rd Qu.: 1.0000 NA 3rd Qu.:40.00 3rd Qu.:3620 3rd Qu.:510.0 3rd Qu.:350 NA NA NA
Max. :46.00 Max. :12.0000 NA Max. :43.00 Max. :4930 Max. :565.0 Max. :390 NA NA NA

We note that the variable Age of the mother has a minimum value equal to 0 meaning that there are some wrong data. So we first display the distribution of the age of the mothers.

ggplot(data = subset(dataset))+
  geom_bar(aes(x=Anni.madre),
           stat="count",
           col="khaki",
           fill="khaki")+
  labs(title="Distribution of the mother age",
       x="Age",
       y="absolute frequency")+
  scale_y_continuous(breaks = seq(0,1200,200))+
  scale_x_continuous(breaks = seq(0,46,2))+
  theme_classic()+
  theme(axis.title.x = element_text(size = 14, margin = margin(t = 10)),
        axis.title.y = element_text(size = 14, margin = margin(r = 10)))

We can assume that the minimum meaningfull age of a mother is 12 years. So we filter the dataset to remove the dirty data corresponding to mothers with an age smaller than 12.

dataset <- dataset %>% 
  filter(Anni.madre >=12)

Next, we start analyzing the anthropometrics data (Weigth, Length and Cranial Circumference) of the newborns in the dataset.

First we report a table with the summary of the statistics of the anthropometrics variables

Indices of weight, length and cranial diameter
quartile_1 median quartile_3 mean sd cv skewness kurtosis std_err
Cranial circumference (mm) 330 340 350 340.03 16.43 4.83 -0.79 2.94 0.33
Weight (gr) 2990 3300 3620 3284.18 525.23 15.99 -0.65 2.03 10.51
Length (mm) 480 500 510 494.70 26.33 5.32 -1.51 6.48 0.53

Second we display the distribution of the anthropometric variables again a normal distribution

# the function normalize takes in input a variable in string form and returns its values normalized by its mean and standard deviation
normalize <- function(x){
  return ((x-mean(x))/sd(x))
}

# we generate a standard normal distribution
gaussian_distribution <- rnorm(1000000,0,1)


# we plot the density of the different variables against a standard normal distribution
ggplot()+
  geom_density(aes(x=gaussian_distribution, fill="N01"), color=NA,  alpha=.8)+
  geom_density(aes(x=normalize(Peso), color="Weight"), linewidth=1.5)+
  geom_density(aes(x=normalize(Cranio), color="Cranial circumference"), linewidth=1.5)+
  geom_density(aes(x=normalize(Lunghezza), color="Length"), linewidth=1.5)+
  labs(title="Distribution of the variables",
       x="Normalized variable ",
       y="density")+
  scale_fill_manual(values = c(N01 = "grey"),   
                    labels = c(N01 = "N(0,1)")) +
  theme_classic()+
  theme(plot.title.position = "panel",
        plot.title=element_text(hjust=0.5))+
  guides(fill = guide_legend(override.aes = list(color = NA)))+
  labs(color = "Variabile",
      fill = "Distribuzione normale")

The plots as also the values of skeweness and Kurtosis show that:

  • All the anthropometric variables are more peaked than a normal distribution, so the probability of outliers is smaller that for the normal distribution with same mean and standard deviation.

  • Values in the left tails of the distributions are more likely than values in the right tail. This could be due to the data relative to premature births.

To check if the asymmetry is due to the premature births we plot the distribution of births by gestational weeks.

ggplot(data=dataset)+
  geom_bar(aes(x=Gestazione),
           stat="count",
           col="pink",
           fill="pink")+
  labs(title="Distribution of gestational age at childbirth",
       x="gestational age (weeks)",
       y="absolute frequency")+
  scale_y_continuous(breaks = seq(0,750,50))+
  scale_x_continuous(breaks = seq(25,43,1))+
  theme_classic()+
  theme(axis.text.x = element_text(angle = 15, hjust = 1),
        axis.title.x = element_text(size = 14, margin = margin(t = 10)),
        axis.title.y = element_text(size = 14, margin = margin(r = 10)))

A delivery is considered premature if it occurs before 37 weeks of gestation. By looking at the barplot we note there exist different many premature birthds in the dataset. We expect the anthropometric measurements of premature newborns to be significantly smaller.

So we filter the data relative only to the non-premature births and study their statistics.

regular_deliveries_data <- dataset %>% filter(Gestazione >= 37)

statistics_regular_deliveries <- data.frame(
        
  quartile_1 = round(apply(regular_deliveries_data[c("Cranio","Peso","Lunghezza")], 2, function(x) quantile(x,probs=0.25, names=FALSE)),2),
                          
  median = round(apply(regular_deliveries_data[c("Cranio","Peso","Lunghezza")], 2, median),2),
                            
  quartile_3 = round(apply(regular_deliveries_data[c("Cranio","Peso","Lunghezza")], 2, function(x) quantile(x,probs=0.75, names=FALSE)),2),
                          
  mean = round(apply(regular_deliveries_data[c("Cranio","Peso","Lunghezza")], 2, mean),2),
                          
  sd = round(apply(regular_deliveries_data[c("Cranio","Peso","Lunghezza")], 2, sd),2),
                          
  cv = round(apply(regular_deliveries_data[c("Cranio","Peso","Lunghezza")], 2, cv_funct),2),
                          
  skewness = round(apply(regular_deliveries_data[c("Cranio","Peso","Lunghezza")], 2, skewness),2),
                          
  kurtosis = round(apply(regular_deliveries_data[c("Cranio","Peso","Lunghezza")], 2, kurtosis_index),2),
  
  std_err = round(apply(regular_deliveries_data[c("Cranio","Peso","Lunghezza")],2,std.err),2)
)

rownames(statistics_regular_deliveries) <- c("Cranial circumference (mm)", "Weight (gr)", "Length (mm)")

kable(statistics_regular_deliveries, caption="Indices of weight, length and cranial diameter of the non-premature births") %>%
  kable_styling()
Indices of weight, length and cranial diameter of the non-premature births
quartile_1 median quartile_3 mean sd cv skewness kurtosis std_err
Cranial circumference (mm) 332 340.5 350 341.71 14.21 4.16 -0.01 0.30 0.29
Weight (gr) 3050 3320.0 3640 3349.56 443.69 13.25 0.13 0.21 9.18
Length (mm) 485 500.0 510 498.11 21.16 4.25 -0.44 2.74 0.44



# we plot the density of the different variables against a standard normal distribution
ggplot()+
  geom_density(aes(x=gaussian_distribution, fill="N01"), color=NA,  alpha=.8)+
  geom_density(aes(x=normalize(regular_deliveries_data$Peso), color="Weight"), linewidth=1.5)+
  geom_density(aes(x=normalize(regular_deliveries_data$Cranio), color="Cranial circumference"), linewidth=1.5)+
  geom_density(aes(x=normalize(regular_deliveries_data$Lunghezza), color="Length"), linewidth=1.5)+
  labs(title="Distribution of the anthropometric variables \n relative to the non-premature deliveries",
       x="Normalized variable ",
       y="density")+
  scale_fill_manual(values = c(N01 = "grey"),   
                    labels = c(N01 = "N(0,1)")) +
  theme_classic()+
  theme(plot.title.position = "panel",
        plot.title=element_text(hjust=0.5))+
  guides(fill = guide_legend(override.aes = list(color = NA)))+
  labs(color = "Variabile",
      fill = "Distribuzione normale")

The significantly smaller values of skeweness of all the anthropometric variables corresponding only to the non-premature births confirm that the asymmetry that we had previously observed was mainly due to the presence of premature births. Also the values of the kurtosis have significanlty decreased suggesting a much better normal approximation of the distribution of the anthropometric variables when filtering the data corresponding to the only non-premature deliveries. The length is the only variables that is not really normally distributed. The non-normality of the variable Lenght could be due to the presenze of some corrupted data. To this end we study the correlation between the anthropometric variables.



cat("The correlation coefficiente of the variable Length with the varibles Weight and Cranial circumference in non premature newborns is equal to: \n", 
    "-cor(Length, Weight) = ", cor(regular_deliveries_data$Lunghezza, regular_deliveries_data$Peso)
 , "\n",
    "-cor(Length, Cranial circumference) = ", cor(regular_deliveries_data$Lunghezza, regular_deliveries_data$Cranio)
)
The correlation coefficiente of the variable Length with the varibles Weight and Cranial circumference in non premature newborns is equal to: 
 -cor(Length, Weight) =  0.7118152 
 -cor(Length, Cranial circumference) =  0.4566794
scatt1 = ggplot(regular_deliveries_data, aes(x = Lunghezza, y = Peso)) + 
                geom_point()+
                theme_classic()+
                theme(axis.title.x = element_text(size = 14, margin = margin(t = 10)),
                      axis.title.y = element_text(size = 14, margin = margin(r = 10)))+
                labs(title="Scatterplot of Lenght against Weight",
                                        x=" Lenght(mm)",
                                        y=" Weight(gr)")+
                scale_y_continuous(breaks = seq(2000,5000,500))+
                scale_x_continuous(breaks = seq(300,600,50))


scatt2 = ggplot(regular_deliveries_data, aes(x = Lunghezza, y = Cranio)) + 
                geom_point()+
                theme_classic()+
                theme(axis.title.x = element_text(size = 14, margin = margin(t = 10)),
                      axis.title.y = element_text(size = 14, margin = margin(r = 10)))+
                labs(title="Scatterplot of Lenght against Cranial circumference",
                                        x=" Lenght(mm)",
                                        y=" Cranial circumference(mm)")+
                scale_x_continuous(breaks = seq(300,600,50))

comb_fig_scatt <- scatt1 + scatt2 + plot_layout(ncol=2, heights = c(5), widths= c(5,5))
print(comb_fig_scatt)

From the scatterplots, it can be seen that there is one outlier measurement reporting a length smaller than 350 mm, which could be due to an error in the dataset, such as a typo in the annotation of the newborn’s length. So we decide to clean the dataset of regular newborns from this datapoint.

regular_deliveries_data <- regular_deliveries_data %>% filter(Lunghezza >= 350)

statistics_regular_deliveries <- data.frame(
        
  quartile_1 = round(apply(regular_deliveries_data[c("Cranio","Peso","Lunghezza")], 2, function(x) quantile(x,probs=0.25, names=FALSE)),2),
                          
  median = round(apply(regular_deliveries_data[c("Cranio","Peso","Lunghezza")], 2, median),2),
                            
  quartile_3 = round(apply(regular_deliveries_data[c("Cranio","Peso","Lunghezza")], 2, function(x) quantile(x,probs=0.75, names=FALSE)),2),
                          
  mean = round(apply(regular_deliveries_data[c("Cranio","Peso","Lunghezza")], 2, mean),2),
                          
  sd = round(apply(regular_deliveries_data[c("Cranio","Peso","Lunghezza")], 2, sd),2),
                          
  cv = round(apply(regular_deliveries_data[c("Cranio","Peso","Lunghezza")], 2, cv_funct),2),
                          
  skewness = round(apply(regular_deliveries_data[c("Cranio","Peso","Lunghezza")], 2, skewness),2),
                          
  kurtosis = round(apply(regular_deliveries_data[c("Cranio","Peso","Lunghezza")], 2, kurtosis_index),2),
  
  std_err = round(apply(regular_deliveries_data[c("Cranio","Peso","Lunghezza")],2,std.err),2)
)

rownames(statistics_regular_deliveries) <- c("Cranial circumference (mm)", "Weight (gr)", "Length (mm)")

kable(statistics_regular_deliveries, caption="Indices of weight, length and cranial diameter of the non-premature births") %>%
  kable_styling()
Indices of weight, length and cranial diameter of the non-premature births
quartile_1 median quartile_3 mean sd cv skewness kurtosis std_err
Cranial circumference (mm) 332 340 350 341.69 14.20 4.15 -0.01 0.30 0.29
Weight (gr) 3050 3320 3640 3349.12 443.28 13.24 0.13 0.21 9.17
Length (mm) 485 500 510 498.19 20.82 4.18 -0.18 0.56 0.43

We observe that now, in the cleaned dataset, also the variable Length follows approximately a normal distribution, with both kurtosis and skewness coefficients close to 0.

Analysis of weight, length and cranial diameter of the sample with respect to the population statistics.

According to the study International standards for newborn weight, length, and head circumference by gestational age and sex, the birthweight, birthlength and birth head circumference of babies born in Italy from the 37th week of gestation on, have the following mean and standard deviation:

  • birthweight: Mean = 3.3, SD = 0.4,

  • birthlength: Mean = 49.4, SD = 1.7,

  • birth head circumference: Mean = 34.0, SD = 1.2

From the central limit theorem we know that given a random variable X with mean \(\mu_0\) and standard deviation \(\sigma\), the mean over a random sample of N people distributes as a random variable following a normal distribution with mean \(\mu_0\) and standard deviation \(\sigma/N\).

So for the three variables weight, length and cranial circumference we assume as null hypothesis (H0) that they have mean \(\mu_0=\) 3300 gr, 49.4 cm and 34 cm and standard deviation \(\sigma=\) 400gr, 1.7cm, 1.2cm, respectively. Then we test this hypotheses using a Z-test. To this end we consider the sub-dataset corresponding to the births after the 37th week of gestation, having length \(N\).

Then we compute the Z-test value \[Z = \frac{mean(Variable)-\mu_0}{\sigma/\sqrt{N}}.\]

We compute the corresponfing p-values, i.e. the probability of getting a value equal or smaller than mean(Variable) by randomly sampling N births. If the p-value is smaller than the level of significance that we set to 0.05, we reject the hypothesis H0.

z_test = function(x, mu0, stdev, alfa){

  x = na.omit(x)
  mu_cap = mean(x)
  n = length(x)
  
  Z = (mu_cap - mu0) / (stdev / sqrt(n))
  valori.soglia = qnorm(c(alfa/2, 1-alfa/2))
  
  CI <- c(mu_cap+qnorm(alfa/2)*(stdev/sqrt(n)),
          mu_cap-qnorm(alfa/2)*(stdev/sqrt(n)))
    
  return(
    list(mean.sample= mu_cap,
         stat.test = Z,
         threshold.val = valori.soglia,
         pvalue = 2*pnorm(-abs(Z)),
         Int.Conf. = CI
         ))
}
z_test_weight = z_test(regular_deliveries_data$Peso, 3300, 400, 0.05)

z_test_length = z_test(regular_deliveries_data$Lunghezza, 494, 17, 0.05)

z_test_cranial_diam = z_test(regular_deliveries_data$Cranio, 340, 12, 0.05)


cat("The p-values of length, weight and cranial circumference are all smaller than 0.05, so we reject the hypothesis H0 that the three variables are distributed according to the distribution of weight, length and cranial diameter of the italian population. \n", 
    "-p_value(Weight) = ", z_test_weight$pvalue , "\n",
    "-p_value(Length) = ", z_test_length$pvalue , "\n",
    "-p_value(Cranial diameter) = ", z_test_cranial_diam$pvalue )
The p-values of length, weight and cranial circumference are all smaller than 0.05, so we reject the hypothesis H0 that the three variables are distributed according to the distribution of weight, length and cranial diameter of the italian population. 
 -p_value(Weight) =  2.960815e-09 
 -p_value(Length) =  1.124683e-32 
 -p_value(Cranial diameter) =  9.627454e-12

In particular we display the Z-test values of the anthropometric variables against the level of significance of 0.05 of the Z-test.

ggplot()+
  geom_density(aes(x=gaussian_distribution, fill="N01"), color=NA,  alpha=.8) +
  geom_vline(aes(xintercept=z_test_weight$stat.test, color="Weight"), linewidth=1.5) +
  geom_vline(aes(xintercept=z_test_length$stat.test, color="Length"), linewidth=1.5)+
  geom_vline(aes(xintercept=z_test_cranial_diam$stat.test, color="Cranial circumference"), linewidth=1.5)+
  geom_vline(aes(xintercept=z_test_length$threshold.val, color="Threshold p=0.05"),linewidth=1, linetype="dashed")+
  scale_color_manual(values= c("Weight"= "gold1", "Length"="gold4", "Cranial circumference"="gold3", "Threshold p=0.05"="red"))+
  scale_fill_manual(values = c(N01 = "grey"),   
                    labels = c(N01 = "N(0,1)")) +
  labs(title="Z values of weight, length and cranial circumference",
       x=" ",
       y=" ")+
  theme_classic()+
  theme(axis.text.x = element_text(angle = 15, hjust = 1),
        axis.title.x = element_text(size = 14, margin = margin(t = 10)),
        axis.title.y = element_text(size = 14, margin = margin(r = 10)),
        legend.position ="bottom")

Analysis of the differences in the antrhopometric variables across the different between males and females.

First we display the distribution of newborns between the two sexes, which result quite uniform.

sex_colors <- c("M" = "dodgerblue", "F" = "hotpink")  


ggplot(data=dataset)+
  geom_bar(aes(x=Sesso,
               fill=Sesso,
               color=Sesso),
           stat="count")+
  labs(title="Sex distribution of newborns",
       x="sex",
       y="absolute frequency",
       color="SEX:",
       fill="SEX:")+
  scale_fill_manual(values= sex_colors)+
  scale_y_continuous(breaks = seq(0,1500,100))+
  theme_classic()+
  theme(axis.text.x = element_text(angle = 15, hjust = 1),
        axis.title.x = element_text(size = 14, margin = margin(t = 10)),
        axis.title.y = element_text(size = 14, margin = margin(r = 10)))

Second we study the statistics of the anthropometric variables across the two sexes and display them via a boxplot. In order to test later differences across the sexes using a t.test it is important to compare normally distrubuted quantities. For this reason, we restrict our study only to the not-premature deliveries.

cv_funct <- function(x) {
  return(sd(x) / mean(x) * 100)
}


conditional_dataset_sex <- regular_deliveries_data %>% 
  group_by(Sesso) %>% 
  summarise(mean_peso = round(mean(Peso),2),
            sd_peso = round(sd(Peso),2),
            cv_peso = round(cv_funct(Peso), 2),
            skew_peso = round(skewness(Peso), 2),
            kurt_peso = round(kurtosis_index(Peso), 2),
            
            mean_lunghezza = round(mean(Lunghezza),2),
            sd_lunghezza = round(sd(Lunghezza),2),
            cv_lunghezza = round(cv_funct(Lunghezza), 2),
            skew_lunghezza = round(skewness(Lunghezza), 2),
            kurt_lunghezza = round(kurtosis_index(Lunghezza), 2),
            
            mean_cranio = round(mean(Cranio),2),
            sd_cranio = round(sd(Cranio),2),
            cv_cranio = round(cv_funct(Cranio), 2),
            skew_cranio = round(skewness(Cranio), 2),
            kurt_cranio = round(kurtosis_index(Cranio), 2))


colnames(conditional_dataset_sex) <- c("sex", "mean", "sd", "cv", "Skew", "kurt" , "mean", "sd", "cv" , "Skew", "kurt",  "mean", "sd", "cv", "Skew", "kurt")

kable(conditional_dataset_sex, caption="Indices of weight, length and cranial diameter of non-premature newborns, divided by sex") %>%
  kable_styling(bootstrap_options = "condensed", full_width = TRUE) %>%
  add_header_above(c(" "=1, "Weight" = 5, "Length" = 5, "Cranial circumference" = 5)) %>%
  row_spec(1, background = "pink")%>%
  row_spec(2, background = "lightblue") %>%
  column_spec(c(1,6,11), border_right = TRUE)
Indices of weight, length and cranial diameter of non-premature newborns, divided by sex
Weight
Length
Cranial circumference
sex mean sd cv Skew kurt mean sd cv Skew kurt mean sd cv Skew kurt
F 3237.00 433.6 13.4 0.35 0.63 493.95 21.12 4.28 -0.17 0.74 339.56 14.08 4.15 0.06 0.33
M 3458.11 425.2 12.3 -0.04 0.26 502.31 19.67 3.92 -0.13 0.34 343.77 14.00 4.07 -0.07 0.38

box_length_sex <- ggplot(data=regular_deliveries_data,
                     aes(x=Sesso,
                        y=Lunghezza,
                        fill=Sesso),
                        color=Sesso)+
                     geom_boxplot()+
                     labs(title="Length ditribution according to sex\n of non premature newborns",
                          x=" ",
                          y="length (cm)")+
                     scale_fill_manual(values= sex_colors)+
                     theme_classic()+
                     theme(plot.title.position = "panel",
                           plot.title = element_text(size=16, hjust=0.5),
                           axis.title.x = element_text(size=14),
                           axis.title.y = element_text(size=14),
                           axis.text.x = element_text(size=14),
                           axis.text.y = element_text(size=10),
                           legend.position = "none")+
                      scale_y_continuous(breaks = seq(310, 560, by = 25))+
                      scale_x_discrete(labels=c("M"="Male", "F"="Female"))


box_weight_sex <- ggplot(data=regular_deliveries_data,
                     aes(x=Sesso,
                        y=Peso,
                        fill=Sesso),
                        color=Sesso)+
                     geom_boxplot()+
                     labs(title="Weight ditribution according to sex\n of non premature newborns",
                          x=" ",
                          y="weight (Kg)")+
                     scale_fill_manual(values= sex_colors)+
                     theme_classic()+
                     theme(plot.title.position = "panel",
                           plot.title = element_text(size=16, hjust=0.5),
                           axis.title.x = element_text(size=14),
                           axis.title.y = element_text(size=14),
                           axis.text.x = element_text(size=14),
                           axis.text.y = element_text(size=10),
                           legend.position = "none")+
                      scale_y_continuous(breaks = seq(1000, 5000, by = 400),
                                        labels = seq(1.0, 5.0, by = 0.4))+
                      scale_x_discrete(labels=c("M"="Male", "F"="Female"))


box_head_sex <- ggplot(data=regular_deliveries_data,
                     aes(x=Sesso,
                        y=Cranio,
                        fill=Sesso),
                        color=Sesso)+
                     geom_boxplot()+
                     labs(title="Cranial diameter ditribution according to sex\n of non premature newborns",
                          x=" ",
                          y="diameter (cm)")+
                     scale_fill_manual(values= c("M"="dodgerblue", "F"="hotpink"))+
                     theme_classic()+
                     theme(plot.title.position = "panel",
                           plot.title = element_text(size=16, hjust=0.5),
                           axis.title.x = element_text(size=14),
                           axis.title.y = element_text(size=14),
                           axis.text.x = element_text(size=14),
                           axis.text.y = element_text(size=10),
                           legend.position = "none")+
                      scale_y_continuous(breaks = seq(250, 400, by = 25))+
                      scale_x_discrete(labels=c("M"="Male", "F"="Female"))


comb_fig_antropometrics <- box_head_sex + box_weight_sex + box_length_sex + plot_layout(ncol=3, heights = c(5), widths= c(5,5,5))
print(comb_fig_antropometrics)

The data suggest a difference in the anthropometric variables across the two sexes, with higher values for all the variables in case of males rather than females.

We test the hypothesis that the antropometric values of males and females are significantly different

In order to perform a meaningfull t-test that compares the distributions of anthropometric measurements of males and females we first have to check that the distributions of the antrhopometric measurements of males and females are approximatelly normal and have similar variances. The previous table suggest that this should be true if we restrict our study only to the non-premature deliveries. We perform Q-Q tests and Shapiro Wilk tests to verify that the three variables distribute according to normal distributions for both sexes. We recall the Q-Q plots display the observed quantiles of the variables against the quantiles of a Gaussian distribution. So, if the variables come from a normal distribution, the points in the Q-Q plot will approximately lie along a straight line.


males <- regular_deliveries_data %>%
  filter(Sesso=="M")

females <- regular_deliveries_data %>%
  filter(Sesso=="F" & Lunghezza>400)


## QQ-plots of the length
QQ_length_males = ggplot(males, aes(sample = Lunghezza)) +
  labs(title="Q-Q Plot of the length\n of male newborns",
            x="Theoretical Quantiles",
            y="Observed Quantiles")+
  stat_qq(color = "dodgerblue") +
  stat_qq_line(color="red") +
  theme_classic()

QQ_length_females = ggplot(females, aes(sample = Lunghezza)) +
  labs(title="Q-Q Plot of the length\n of female newborns",
            x="Theoretical Quantiles",
            y="Observed Quantiles")+
  stat_qq(color = "hotpink") +
  stat_qq_line(color="black") +
  theme_classic()

## QQ_plots of the weight
QQ_weight_males = ggplot(males, aes(sample = Peso)) +
  labs(title="Q-Q Plot of the weight\n of male newborns",
            x="Theoretical Quantiles",
            y="Observed Quantiles")+
  stat_qq(color = "dodgerblue") +
  stat_qq_line(color="red") +
  theme_classic()

QQ_weight_females = ggplot(females, aes(sample = Peso)) +
  labs(title="Q-Q Plot of the weight\n of female newborns",
            x="Theoretical Quantiles",
            y="Observed Quantiles")+
  stat_qq(color = "hotpink") +
  stat_qq_line(color="black") +
  theme_classic()

QQ_cranial_males = ggplot(males, aes(sample = Cranio)) +
  labs(title="Q-Q Plot of the cranial circumference\n of male newborns",
            x="Theoretical Quantiles",
            y="Observed Quantiles")+
  stat_qq(color = "dodgerblue") +
  stat_qq_line(color="red") +
  theme_classic()

QQ_cranial_females = ggplot(females, aes(sample = Cranio)) +
  labs(title="Q-Q Plot of the cranial circumference\n of female newborns",
            x="Theoretical Quantiles",
            y="Observed Quantiles")+
  stat_qq(color = "hotpink") +
  stat_qq_line(color="black") +
  theme_classic()

comb_QQ <- QQ_length_males + QQ_length_females +
                  QQ_weight_males + QQ_weight_females +
                  QQ_cranial_males + QQ_cranial_females +
                  plot_layout(ncol=2, heights = c(5,5,5), widths= c(5,5))

print(comb_QQ)



shaptestLM = shapiro.test(males$Lunghezza)
shaptestLF = shapiro.test(females$Lunghezza)
shaptestWM = shapiro.test(males$Peso)
shaptestWF = shapiro.test(females$Peso)
shaptestCM = shapiro.test(males$Cranio)
shaptestCF = shapiro.test(females$Cranio)

cat("The maximum Shapiro-Wilk statistics of the 6 normality tests about distribution of weight, length and cranial circumference across the two sexes is:", min(shaptestLM$statistic, shaptestLF$statistic, shaptestWM$statistic, shaptestWF$statistic, shaptestCM$statistic, shaptestCF$statistic), "\n")
The maximum Shapiro-Wilk statistics of the 6 normality tests about distribution of weight, length and cranial circumference across the two sexes is: 0.9875359 
cat("The minimum Shapiro-Wilk p.value of the 6 normality tests about distribution of weight, length and cranial circumference across the two sexes is:", min(shaptestLM$p.value, shaptestLF$p.value, shaptestWM$p.value, shaptestWF$p.value, shaptestCM$p.value, shaptestCF$p.value))
The minimum Shapiro-Wilk p.value of the 6 normality tests about distribution of weight, length and cranial circumference across the two sexes is: 2.185776e-08

From the Q-Q plots and the Shapiro-Wilk tests we can argue that the anthropometric variables, even if not exactly distributed according to normal distributions, are really close to normal distributions. So we can perform a t.test to verify the null hypothesis that the distribution followed by males and females is the same.

ttest_length = t.test(males$Lunghezza, females$Lunghezza,  
                 alternative = "two.sided", 
                 mu = 0, 
                 var.equal = TRUE,
                 conf.level = 0.95,
                 paired = FALSE) 

ttest_weight = t.test(males$Peso, females$Peso,  
                 alternative = "two.sided", 
                 mu = 0, 
                 var.equal = TRUE,
                 conf.level = 0.95,
                 paired = FALSE) 

ttest_cranium = t.test(males$Cranio, females$Cranio,  
                 alternative = "two.sided", 
                 mu = 0, 
                 var.equal = TRUE,
                 conf.level = 0.95,
                 paired = FALSE) 

cat("The p-values of the t-tests about length, weight and cranial circumference distribution comparison between males and females are: \n", 
    "-p_value(Weight) = ", ttest_weight$p.value , "\n",
    "-p_value(Length) = ", ttest_length$p.value , "\n",
    "-p_value(Cranial diameter) = ", ttest_cranium$p.value )
The p-values of the t-tests about length, weight and cranial circumference distribution comparison between males and females are: 
 -p_value(Weight) =  3.149166e-34 
 -p_value(Length) =  1.88921e-22 
 -p_value(Cranial diameter) =  8.994501e-13

The very small p-values of the three t.tests show that there is a significant statistical difference in the mean of the anthropometric variables between males and females. In particular we can refuse the null hypothesis H0 and assert that the distributions of the anthropometric measurements followed by males and females are different.

Analysis of cesarean deliveries

Now we analyze the number of cesarean deliveries across the different hospitals. First we study the the absolute and relative frequency of each hospital.

#create a conditional dataset with the number of deliveries and cesarean deliveries per hospital
conditional_dataset_ospedale <- 
  dataset %>% 
  group_by(Ospedale) %>%  
  summarise(cesareo = sum(Tipo.parto == "Ces", na.rm = TRUE),
            naturale = sum(Tipo.parto == "Nat", na.rm = TRUE),
            parti= sum())
Freq_ass_hosp = table(dataset["Ospedale"])
Freq_rel_hosp = table(dataset["Ospedale"])/N_observations
Freq_hosp = cbind(Freq_ass_hosp,Freq_rel_hosp)
rownames(Freq_hosp) <- c("Hospital1","Hospital2","Hospital3")
colnames(Freq_hosp) <- c("Absolute frequency","Relative frequency")
kable(t(Freq_hosp), caption = 'Deliveries distribution across hospitals') %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
Deliveries distribution across hospitals
Hospital1 Hospital2 Hospital3
Absolute frequency 816.0000 848.0000 834.0000
Relative frequency 0.3264 0.3392 0.3336

We have observed that the distribution of the deliveries across the three hospitals is essentially uniform.

Next we use a boxplot to display the percentage of cesarean deliveries in each hospital.


hospitals=c("osp1" = "Hospital1", "osp2"="Hospital2", "osp3"="Hospital3")

dataset$Tipo.parto <- factor(dataset$Tipo.parto, levels = c("Nat", "Ces"))

bar_cesarean_hospital =  ggplot(data=dataset)+
                                geom_bar(aes(x=Ospedale,
                                             fill=Tipo.parto),
                                         position="fill")+
                                labs(title="Percentage of cesarean childbirths by hospitals",
                                     x="",
                                     y="Percentage",
                                     fill="type of delivery:")+
                                scale_fill_manual(values= c("Ces"="red", "Nat"="ivory2"), labels = c("Ces"= "Cesarean", "Nat"="Natural"))+
                                theme_classic()+
                                theme(plot.title.position = "panel",
                                      plot.title = element_text(size=14, hjust=0.5),
                                      axis.title.x = element_text(size=12),
                                      axis.title.y = element_text(size=12),
                                      axis.text.x = element_text(size=10),
                                      axis.text.y = element_text(size=10),
                                      legend.position = "bottom",
                                      legend.title = element_text(size = 12),
                                      legend.text = element_text(size = 12))+
                                 scale_y_continuous(labels = percent_format(),
                                           breaks = seq(0, 1, by = 0.1))+
                                 scale_x_discrete(labels=hospitals)


print(bar_cesarean_hospital)

The boxplot show that in all the three hospitals the percentage of cesarean deliveries is approximately of 30%. In particular, the number of cesarean deliveries seems to be marginally lower in the Hospital3 with respect to Hospital1 and Hospital2.

Next we test the null hypothesis that the distribution of cesarean deliveries is the same and so uniform across the three hospitals

We first use a Pearson’s Chi-squared test to examine the hypothesis that the proportion of cesarean deliveries is equal across the hospitals. It compares the observed frequencies of cesarean and natural deliveries in each hospital with the expected frequencies calculated under the null hypothesis of uniform distribution of cesarean deliveries across hospitals, literally: \[Expected\_Freq\_Ces(Hosp) = Prob(Cesarean)* Number\_Deliveries(Hosp) \] where \(Prob(Cesarean)=N\_tot(Cesarean)/N\_tot(deliveries)\). The test assesses whether the observed differences are statistically significant by providing a chi-squared statistic and a p-value to support or reject the null hypothesis.

# create contingency table bewtween hospitals and kind of delivery
table_delivery_hospital = table(Ospedale, Tipo.parto)

#The Pearson’s Chi-squared test is performed 
chisq_test = chisq.test(table_delivery_hospital)
cat("The p-value of the Pearson's Chi-squared test is" , chisq_test$p.value)
The p-value of the Pearson's Chi-squared test is 0.5777576

Based on the p-value of the Pearson’s Chi-squared we can not reject the null-hypothesis.

The test Chi-squared that we have performed is based on the assumption that for sufficiently many observations the distribution of the frequencies of cesarean deliveries at each hospital is close to a normal distribution, because of the central limit theorem. However, in this case, we know that the number of cesarean deliveries at the \(i\)-th hospital follows a binomial distribution \(B(p_i)\). So we can also perform a binomial test for each hospital to examine the hypothesis that \(p_i=Prob(Cesarean)\) for each hospital. \(Prob(Cesarean)\) as defined above is assumed as an accurate approximation of probability of cesarean delivery if the distribution is uniform across the hospitals, in particular since the distribution of deliveries per hospital is quite uniform there is no need to use a weighted formula for the probability \(Prob(Cesarean)\). We report the the \(p\)-value and the confidence interval of this test for each hospital.

# first we create a table with the total number of deliveries and cesarean deliveries observed at each hospital.

sub_dataset = subset(dataset, Ospedale=="osp1")
N_ces_hosp = sum(sub_dataset$Tipo.parto=="Ces")
N_deliveries_hosp = nrow(sub_dataset)

Freq_ces_hosp = table(Ospedale[Tipo.parto=="Ces"])
Ces_hosp = cbind(Freq_ass_hosp,Freq_ces_hosp)
Ces_hosp = as.data.frame(Ces_hosp)

rownames(Ces_hosp) <- c("Hospital_1", "Hospital_2", "Hospital_3")

# Next we compute Prob(cesarean) that should be a good approximation of the probability of oberving a cesarean delivery assuming this is uniform across the three hospitals.

prob_cesarean = sum(Tipo.parto=="Ces")/N_observations

# Last we perform a binomial test for each hospital
p_values <- numeric(nrow(Ces_hosp))
conf_int <- numeric(nrow(Ces_hosp))

for(i in seq_len(nrow(Ces_hosp))){
  test <- binom.test(Ces_hosp$Freq_ces_hosp[i], Ces_hosp$Freq_ass_hosp[i], p = prob_cesarean, alternative = "two.sided")
  p_values[i] <- test$p.value
  conf_int[i] <- paste0("(", round(test$conf.int[1],2), ",", round(test$conf.int[2],2),")")  
}

Ces_hosp$p_value <- p_values
Ces_hosp$conf_int <- conf_int

kable(t(Ces_hosp[, c("p_value", "conf_int")]), caption = 'P-values and confidence intervals for the binomial outcomes of each hospital,1n calculated under the assumption of a binomial distribution with a uniform probability\n of cesarean deliveries across all hospitals.') %>%
  kable_styling(bootstrap_options = c("striped", "hover"), position="left")
P-values and confidence intervals for the binomial outcomes of each hospital,1n calculated under the assumption of a binomial distribution with a uniform probability of cesarean deliveries across all hospitals.
Hospital_1 Hospital_2 Hospital_3
p_value 0.7289190 0.5967907 0.4236389
conf_int (0.27,0.33) (0.27,0.33) (0.25,0.31)

Also in this case the p-value of the binomial tests do not allow to reject the null hypothesis asserting that the number of deliveries across the three hospitals is uniform.

Next we analyze the frequency of smoking mothers.

Freq_rel_smoker = sum(dataset$Fumatrici == 1)/N_observations

cat("The percentage of smoking mothers is:", paste0(Freq_rel_smoker,"%"))
The percentage of smoking mothers is: 0.0416%

Moreover we investigate if the smoke influence some of the other variables, including weight, length and Cranial circumference and gestational age.


boxplot_smoking_1= ggplot(data=dataset,
                           aes(x=Fumatrici,
                               y=Peso,
                               fill=Fumatrici))+
                           geom_boxplot()+
                           labs(title="Weight distribution",
                                x=" ",
                                y="weight (kg)")+
                           scale_fill_manual(values= c("0"="green", "1"="grey"), 
                                             labels = c("0"= "Non-smoker", "1"="Smoker"))+
                           theme_classic()+
                           theme(plot.title.position = "panel",
                                 plot.title = element_text(size=16, hjust=0.5),
                                 axis.title.x = element_text(size=14),
                                 axis.title.y = element_text(size=14),
                                 axis.text.x = element_text(size=14),
                                 axis.text.y = element_text(size=10),
                                 legend.position = "none")+
                                 scale_y_continuous(labels = seq(1,5,0.5),
                                                     breaks = seq(1000, 5000, by = 500))+
                                 scale_x_discrete(labels=c("Non-smoker", "Smoker"))


boxplot_smoking_2 = ggplot(data=dataset,
                           aes(x=Fumatrici,
                               y=Lunghezza,
                              fill=Fumatrici))+
                           geom_boxplot()+
                           labs(title="Length distribution",
                                x=" ",
                                y="Length (cm)")+
                           scale_fill_manual(values= c("0"="green", "1"="grey"), 
                                             labels = c("0"= "Non-smoker", "1"="Smoker"))+
                           theme_classic()+
                           theme(plot.title.position = "panel",
                                 plot.title = element_text(size=16, hjust=0.5),
                                 axis.title.x = element_text(size=14),
                                 axis.title.y = element_text(size=14),
                                 axis.text.x = element_text(size=14),
                                 axis.text.y = element_text(size=10),
                                 legend.position = "none")+
                                 scale_y_continuous(breaks = seq(300,600,50),
                                                    labels = seq(30, 60, 5))+
                                 scale_x_discrete(labels=c("Non-smoker", "Smoker"))


boxplot_smoking_3 = ggplot(data=dataset,
                           aes(x=Fumatrici,
                               y=Cranio,
                              fill=Fumatrici))+
                           geom_boxplot()+
                           labs(title="Cranial circumference distribution",
                                x=" ",
                                y="Cranial circumference (cm)")+
                           scale_fill_manual(values= c("0"="green", "1"="grey"), 
                                             labels = c("0"= "Non-smoker", "1"="Smoker"))+
                           theme_classic()+
                           theme(plot.title.position = "panel",
                                 plot.title = element_text(size=16, hjust=0.5),
                                 axis.title.x = element_text(size=14),
                                 axis.title.y = element_text(size=14),
                                 axis.text.x = element_text(size=14),
                                 axis.text.y = element_text(size=10),
                                 legend.position = "none")+
                                 scale_y_continuous(labels = seq(25,40,5),
                                                     breaks = seq(250, 400, by = 50))+
                                 scale_x_discrete(labels=c("Non-smoker", "Smoker"))



boxplot_smoking_4 = ggplot(data=dataset,
                           aes(x=Fumatrici,
                               y=Gestazione,
                              fill=Fumatrici))+
                           geom_boxplot()+
                           labs(title="Gestational time distribution",
                                x=" ",
                                y="Gestational time")+
                           scale_fill_manual(values= c("0"="green", "1"="grey"), 
                                             labels = c("0"= "Non-smoker", "1"="Smoker"))+
                           theme_classic()+
                           theme(plot.title.position = "panel",
                                 plot.title = element_text(size=16, hjust=0.5),
                                 axis.title.x = element_text(size=14),
                                 axis.title.y = element_text(size=14),
                                 axis.text.x = element_text(size=14),
                                 axis.text.y = element_text(size=10),
                                 legend.position = "none")+
                                 scale_y_continuous(breaks = seq(26, 42, by = 2))+
                                 scale_x_discrete(labels=c("Non-smoker", "Smoker"))

comb_smoke <- boxplot_smoking_1 + boxplot_smoking_2 + boxplot_smoking_3 + boxplot_smoking_4 +    plot_layout(ncol=2, heights = c(5,5), widths= c(5,5))

print(comb_smoke)

The boxplot may suggest a difference in the measurements and gestational time between smoking and non smoking mothers. We test this hypothesis using a t.test.

Smokers_data <- dataset %>%
  filter(Fumatrici=="1")

Not_smokers_data <- dataset %>%
  filter(Fumatrici=="0")

ttest_smoke_weight = t.test(Smokers_data$Peso, Not_smokers_data$Peso,  
                 alternative = "two.sided", 
                 mu = 0, 
                 var.equal = TRUE,
                 conf.level = 0.95,
                 paired = FALSE) 

ttest_smoke_length = t.test(Smokers_data$Lunghezza, Not_smokers_data$Lunghezza,  
                 alternative = "two.sided", 
                 mu = 0, 
                 var.equal = TRUE,
                 conf.level = 0.95,
                 paired = FALSE) 


ttest_smoke_cranium = t.test(Smokers_data$Cranio, Not_smokers_data$Cranio,  
                 alternative = "two.sided", 
                 mu = 0, 
                 var.equal = TRUE,
                 conf.level = 0.95,
                 paired = FALSE) 

ttest_smoke_gestation = t.test(Smokers_data$Gestazione, Not_smokers_data$Gestazione,  
                 alternative = "two.sided", 
                 mu = 0, 
                 var.equal = TRUE,
                 conf.level = 0.95,
                 paired = FALSE) 

cat("The p-values of the t-tests about length, weight, cranial circumference and gestation time distribution comparison between smoking and non-smoking mothers are: \n", 
    "-p_value(Weight) = ", ttest_smoke_weight$p.value , "\n",
    "-p_value(Length) = ", ttest_smoke_length$p.value , "\n",
    "-p_value(Cranial diameter) = ", ttest_smoke_cranium$p.value, "\n",
    "-p_value(Cranial diameter) = ", ttest_smoke_gestation$p.value)
The p-values of the t-tests about length, weight, cranial circumference and gestation time distribution comparison between smoking and non-smoking mothers are: 
 -p_value(Weight) =  0.3428237 
 -p_value(Length) =  0.2984492 
 -p_value(Cranial diameter) =  0.6650278 
 -p_value(Cranial diameter) =  0.1064455

The t.tests do not allow us to reject the hypothesis that the gestational time and antropometric measurments of newborns are statistically different according to smoking and not-smoking mothers.

Next we analyze the number of deliveries and relation with the age of the mothers.

First we explore the variable number of deliveries.

Freq_abs_N.deliveries = table(dataset["N.gravidanze"]) 
Freq_rel_N.deliveries = table(dataset["N.gravidanze"])/N_observations

Freq_N.deliveries=data.frame(N.gravidanze = names(Freq_abs_N.deliveries),
                             Freq_abs = as.numeric(Freq_abs_N.deliveries),
                             Freq_rel = paste(round(Freq_rel_N.deliveries,2),"%"))

Freq_N.deliveries$N.gravidanze <- factor(Freq_N.deliveries$N.gravidanze, 
                                        levels=as.character(sort(as.numeric(unique(Freq_N.deliveries$N.gravidanze)))))


ggplot(data=Freq_N.deliveries)+
  geom_bar(aes(x=N.gravidanze, y=Freq_abs),
           stat="identity",
           col="aquamarine",
           fill="aquamarine")+
  labs(title="Number of previous pregnancies",
       x="",
       y="absolute frequency")+
  scale_y_continuous(breaks = seq(0,1200,200))+
  geom_text(aes(x=N.gravidanze, y=Freq_abs+50, label = Freq_rel))+
  theme_classic()+
  theme(axis.title.x = element_text(size = 14, margin = margin(t = 10)),
        axis.title.y = element_text(size = 14, margin = margin(r = 10)))

We observe that most of deliveries correspond to first deliveries and that less than 10% of the mothers have 4 children or more.

Next we investigate if there is a correlation between the number of children and the age of the mother. To this end we split the dataset in classes according to the Number of deliveries and years of the mother and we display the dstribution according to the cross classes.

dataset$N.gravidanze_CL <- cut(dataset$N.gravidanze, 
                     breaks=c(0,1,2,3,100), 
                     right=FALSE, 
                     labels=c("0", "1","2","3+"))


dataset$Anni.madre_CL <- cut(dataset$Anni.madre, 
                     breaks=c(11,20,26,32,38,46))

dataset$N.gravidanze_CL <- cut(dataset$N.gravidanze, 
                     breaks=c(0,1,2,3,100), 
                     right=FALSE, 
                     labels=c("0","1","2","3+"))

cont_table_year_N.delivery = table(dataset$Anni.madre_CL, dataset$N.gravidanze_CL)

ggpubr::ggballoonplot(data=as.data.frame(cont_table_year_N.delivery),
                      fill="khaki")

As expected we observe that an increasing number of children usually correspond to older mothers. We test the hypothesis the the distribution of number of childer is not uniform across the ages via a Pearson chi-squared test.

chisq_test = chisq.test(cont_table_year_N.delivery)
cat("The p-value of the Pearson's Chi-squared test is" , chisq_test$p.value) 
The p-value of the Pearson's Chi-squared test is 1.116617e-91

The Pearson’s Chi squared test confirms the hypothesis that the distribution of number of deliveries per age is not uniform.

Creation of a regression model for the weight of the newborn.

We start studying the correlations between the different variables and the variable weight. First we display the correlations between the different quantitative variable and the variable weight.

# panel.cor <- function(x, y, digits = 2, prefix = "", cex.cor, ...)
# {
#   #usr <- par("usr"); on.exit(par(usr))
#   #par(usr = c(0, 1, 0, 1))
#   r <- (cor(x, y))
#   txt <- format(c(r, 1), digits = digits)[1]
#   txt <- paste0(prefix, txt)
#   if(missing(cex.cor)) cex.cor <- 0.8/strwidth(txt)
#   text(0.5, 0.5, txt, cex = 1.5)
# }

panel.cor <- function(x, y, digits = 2, prefix = "", cex.cor, ...)
{
  r <- cor(x, y, use = "complete.obs")
  txt <- format(c(r, 1), digits = digits)[1]
  txt <- paste0(prefix, txt)
  
  usr <- par("usr") # leggi coordinate correnti
  x_pos <- mean(usr[1:2])  # centro asse x
  y_pos <- mean(usr[3:4])  # centro asse y
  
  if(missing(cex.cor)) cex.cor <- 10/strwidth(txt)
  text(x_pos, y_pos, txt, cex = 1.5, ...)
}


#correlazioni
pairs(dataset[, c("Anni.madre","N.gravidanze", "Gestazione", "Peso", "Lunghezza", "Cranio")], lower.panel=panel.cor, upper.panel=panel.smooth)

The correlation matrix suggest a medium to high correlation between Weight and Length/Cranium circumference and gestation time. Differently, the weight seems not to be correlated to the number of deliveries and age of the mother. Additionally we have already observed that there is not a significant correlation between the weight of the newborn and the fact that the mother is a smoker. Differently we have observed a strong correlation between the sex of the newborn and its weight.

The scatter plots also suggest a possible nonlinear dependance of the weight of the newborns with respect to gestation time and possibly Length and cranium circumference.

Even if this is not observable from the correlation matrix, the literature suggest that there could be a difference in the weight of the newborns between the first child and the subsequent ones (see International standards for newborn weight, length, and head circumference by gestational age and sex), for this reason we split the dataset according to the Number of delivery classes “0” and “1+” and use this as a qualitative variable in the investigation of a good regression model fo the weight of the newborns.

dataset$N.gravidanze_CL <- cut(dataset$N.gravidanze, 
                     breaks=c(0,1,100), 
                     right=FALSE, 
                     labels=c("0","1+"))

We investiagte the best linear regression model using a stepwise selection procedure. We start from the full model from which we omit the variables Hospital and Kind of delivery since there is no reason why these variables should influence the weight of the newborn. Then we subtract or add back the variables that seems less influential. Finally we select the best model based on the BIC metrics.

mod_tot<-lm(Peso ~ Anni.madre + N.gravidanze_CL + Fumatrici + Gestazione + Lunghezza + Cranio + Sesso, data=dataset)
summary(mod_tot)

Call:
lm(formula = Peso ~ Anni.madre + N.gravidanze_CL + Fumatrici + 
    Gestazione + Lunghezza + Cranio + Sesso, data = dataset)

Residuals:
     Min       1Q   Median       3Q      Max 
-1163.53  -183.34   -14.47   164.63  2615.88 

Coefficients:
                    Estimate Std. Error t value Pr(>|t|)    
(Intercept)       -6721.5968   141.3193 -47.563  < 2e-16 ***
Anni.madre            1.2298     1.1238   1.094   0.2739    
N.gravidanze_CL1+    23.2268    11.8572   1.959   0.0502 .  
Fumatrici1          -30.5500    27.6434  -1.105   0.2692    
Gestazione           33.1338     3.8410   8.626  < 2e-16 ***
Lunghezza            10.2148     0.3011  33.930  < 2e-16 ***
Cranio               10.5152     0.4282  24.556  < 2e-16 ***
SessoM               78.5750    11.2142   7.007 3.13e-12 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 274.8 on 2490 degrees of freedom
Multiple R-squared:  0.727, Adjusted R-squared:  0.7262 
F-statistic: 947.2 on 7 and 2490 DF,  p-value: < 2.2e-16

The variables “age”, “N.deliveries” and “smoking” seeem to be the less significant. So we start removing the variable “age” from the model.

mod1<-lm(Peso ~  N.gravidanze_CL + Fumatrici + Gestazione + Lunghezza + Cranio + Sesso, data=dataset)
summary(mod1)

Call:
lm(formula = Peso ~ N.gravidanze_CL + Fumatrici + Gestazione + 
    Lunghezza + Cranio + Sesso, data = dataset)

Residuals:
     Min       1Q   Median       3Q      Max 
-1148.91  -183.57   -14.79   163.87  2620.69 

Coefficients:
                    Estimate Std. Error t value Pr(>|t|)    
(Intercept)       -6679.3566   135.9513 -49.131  < 2e-16 ***
N.gravidanze_CL1+    27.2419    11.2757   2.416   0.0158 *  
Fumatrici1          -30.8461    27.6432  -1.116   0.2646    
Gestazione           32.7241     3.8229   8.560  < 2e-16 ***
Lunghezza            10.2109     0.3010  33.918  < 2e-16 ***
Cranio               10.5387     0.4277  24.640  < 2e-16 ***
SessoM               78.7767    11.2132   7.025 2.74e-12 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 274.8 on 2491 degrees of freedom
Multiple R-squared:  0.7269,    Adjusted R-squared:  0.7262 
F-statistic:  1105 on 6 and 2491 DF,  p-value: < 2.2e-16

cat(“After removing the variable "age" the variables "N.deliveries" and "smoking" are still not significant.” ) cat(“So we remove the variable "smoking" from the model.” )

mod2<-lm(Peso ~ N.gravidanze_CL + Gestazione + Lunghezza + Cranio + Sesso, data=dataset)
summary(mod2)

Call:
lm(formula = Peso ~ N.gravidanze_CL + Gestazione + Lunghezza + 
    Cranio + Sesso, data = dataset)

Residuals:
     Min       1Q   Median       3Q      Max 
-1147.94  -184.03   -14.54   164.49  2624.44 

Coefficients:
                    Estimate Std. Error t value Pr(>|t|)    
(Intercept)       -6678.5707   135.9561 -49.123  < 2e-16 ***
N.gravidanze_CL1+    26.4198    11.2522   2.348    0.019 *  
Gestazione           32.4468     3.8150   8.505  < 2e-16 ***
Lunghezza            10.2257     0.3008  33.999  < 2e-16 ***
Cranio               10.5444     0.4277  24.654  < 2e-16 ***
SessoM               78.5900    11.2125   7.009 3.07e-12 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 274.8 on 2492 degrees of freedom
Multiple R-squared:  0.7267,    Adjusted R-squared:  0.7262 
F-statistic:  1325 on 5 and 2492 DF,  p-value: < 2.2e-16

After removing the variable smoking the variables N.deliveries is still little significant. So we remove also this variable from the model.

mod3<-lm(Peso ~ Gestazione + Lunghezza + Cranio + Sesso, data=dataset)
summary(mod3)

Call:
lm(formula = Peso ~ Gestazione + Lunghezza + Cranio + Sesso, 
    data = dataset)

Residuals:
    Min      1Q  Median      3Q     Max 
-1138.1  -184.4   -17.4   163.3  2626.7 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -6651.6732   135.5952 -49.055  < 2e-16 ***
Gestazione     31.3262     3.7884   8.269  < 2e-16 ***
Lunghezza      10.2024     0.3009  33.909  < 2e-16 ***
Cranio         10.6706     0.4247  25.126  < 2e-16 ***
SessoM         79.1027    11.2205   7.050 2.31e-12 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 275.1 on 2493 degrees of freedom
Multiple R-squared:  0.7261,    Adjusted R-squared:  0.7257 
F-statistic:  1652 on 4 and 2493 DF,  p-value: < 2.2e-16

In the final model all the remaining variables seem very significant so we don’t remove any further variable. We also note that the adjusted R-squared has remained unchanged after removing the variables “age” and “smoking” while has only slightly reduced when removing the variable “N.deliveries”. Next we test the different models that we have considered by looking at the analysis of variance between each model and the subsequent one and then we look at the BIC value of the four models.

kable(anova(mod_tot, mod1, mod2, mod3), caption="Analysis of Variance of each model with respect to the previous one") %>%
  kable_styling()
Analysis of Variance of each model with respect to the previous one
Res.Df RSS Df Sum of Sq F Pr(>F)
2490 188062157 NA NA NA NA
2491 188152606 -1 -90449.42 1.197578 0.2739122
2492 188246657 -1 -94050.27 1.245254 0.2645685
2493 188663107 -1 -416450.41 5.513930 0.0189435

The analysis of variance between each model and the subsequent one shows that every time we have removed a variable the residual sum of squares has reduced, however the reduction has been statistically significant only once that we have removed all the three variables.

kable(BIC(mod_tot, mod1, mod2, mod3), caption="Bayesian Information Criterion (BIC) of the four models considered") %>%
  kable_styling()
Bayesian Information Criterion (BIC) of the four models considered
df BIC
mod_tot 9 35209.56
mod1 8 35202.94
mod2 7 35196.36
mod3 6 35194.06

Including only linear terms in the variables we find that the best model, according to the BIC values, is the one that predicts the weight of the newborn based only on the variables Length, Cranial circumference, sex and gestation time, followed closely by the model that included also the dependence upon the number of deliveries.

Next we start from the best model that we found and consider if to include also nonlinear terms in the variables. In particular we note that the the weigth should be proportional to the volume of the body and that the volume likely depends on cubic terms in the variables length and cranial circumferece. Indeed if we think about the body as a cylinder or parallelepipedon we could write its volume in terms of \(length\times(cranial\_circumference)^2\) or \(length^2\times cranial\_circumference\).

Additionally the scatterplots suggest a possible nonlinear dependence of the weight on the gestational age. As a first step we study the correlation of some nonlinear terms between each other and with the variable weight.

dataset$Gestazione2 <- (dataset$Gestazione)^2
dataset$logGestazione <- log(dataset$Gestazione)

pairs(dataset[, c("Peso", "Gestazione", "Gestazione2",
                "logGestazione")], lower.panel=panel.cor, upper.panel=panel.smooth)


dataset$Lunghezza2 <- (dataset$Lunghezza)^2
dataset$Lunghezza2xCranio <- (dataset$Lunghezza)^2*(dataset$Cranio)
dataset$Cranio2 <- (dataset$Cranio)^2
dataset$LunghezzaxCranio2 <- dataset$Lunghezza*(dataset$Cranio)^2

pairs(dataset[, c("Peso","Lunghezza", "Lunghezza2", "Lunghezza2xCranio", 
                "Cranio", "Cranio2", "LunghezzaxCranio2")], lower.panel=panel.cor, upper.panel=panel.smooth)

Due to the realtively small range intervals of the variables we note that the correlation between all the variables Gestation time, Length and Cranial circumference with their quadratic version, as also between gestation time and its logarithm, is 1 up to a second error approximation. So it doesn’t make sense to include these terms in the model since we can not expect them to perform significantly better than the linear terms.

The situation is slightly different for the cubic terms mixed in length and cranial circumference. Indeed, even if they have high correlations with the variables length and cranial circumference they also have higher correlations with the variable weight with respect to the linear terms. So we try to include these terms in the model.

tot_non_lin_mod <- lm(Peso ~ N.gravidanze_CL + Gestazione + Lunghezza + Cranio + Lunghezza2xCranio + LunghezzaxCranio2 + Sesso, data=dataset)

nonlin_mod = MASS::stepAIC(tot_non_lin_mod, direction="both", k=log(N_observations))
Start:  AIC=28038.58
Peso ~ N.gravidanze_CL + Gestazione + Lunghezza + Cranio + Lunghezza2xCranio + 
    LunghezzaxCranio2 + Sesso

                    Df Sum of Sq       RSS   AIC
<none>                           182562021 28039
- N.gravidanze_CL    1    604520 183166541 28039
- Lunghezza          1    978930 183540951 28044
- LunghezzaxCranio2  1   1609628 184171649 28053
- Cranio             1   1946141 184508162 28057
- Sesso              1   3096848 185658869 28073
- Lunghezza2xCranio  1   3611489 186173510 28080
- Gestazione         1   7719667 190281688 28134
summary(nonlin_mod)

Call:
lm(formula = Peso ~ N.gravidanze_CL + Gestazione + Lunghezza + 
    Cranio + Lunghezza2xCranio + LunghezzaxCranio2 + Sesso, data = dataset)

Residuals:
     Min       1Q   Median       3Q      Max 
-1177.37  -182.96   -11.36   163.90  1860.40 

Coefficients:
                    Estimate Std. Error t value Pr(>|t|)    
(Intercept)       -3.618e+03  7.682e+02  -4.709 2.62e-06 ***
N.gravidanze_CL1+  3.190e+01  1.111e+01   2.871 0.004121 ** 
Gestazione         4.063e+01  3.960e+00  10.261  < 2e-16 ***
Lunghezza         -7.825e+00  2.142e+00  -3.654 0.000264 ***
Cranio             2.131e+01  4.136e+00   5.152 2.78e-07 ***
Lunghezza2xCranio  8.887e-05  1.266e-05   7.018 2.88e-12 ***
LunghezzaxCranio2 -9.650e-05  2.060e-05  -4.686 2.94e-06 ***
SessoM             7.207e+01  1.109e+01   6.499 9.73e-11 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 270.8 on 2490 degrees of freedom
Multiple R-squared:  0.735, Adjusted R-squared:  0.7342 
F-statistic: 986.5 on 7 and 2490 DF,  p-value: < 2.2e-16
kable(BIC(mod3, nonlin_mod), caption="Bayesian Information Criterion (BIC) of the linear model vs the nonlinear one") %>%
  kable_styling()
Bayesian Information Criterion (BIC) of the linear model vs the nonlinear one
df BIC
mod3 6 35194.06
nonlin_mod 9 35135.41
NA

The model selected by the stepwise method by taking into account for all the possible nonlinear terms that we have previously considered as a Bayesian Information Criterion (BIC) value significantly smaller than the best model that is linear in all the variables.

Next we compute the variance inflation factors of the selected model.


kable(car::vif(nonlin_mod), caption="Variance Inflation factors of the nonlinear model") %>%
  kable_styling()
Variance Inflation factors of the nonlinear model
x
N.gravidanze_CL 1.035160
Gestazione 1.865538
Lunghezza 108.279420
Cranio 157.281246
Lunghezza2xCranio 666.841683
LunghezzaxCranio2 790.238306
Sesso 1.047540
NA

As expected the model has extremely high variance inflaction factor due to the high correlation coefficients between the variables. Since the high variance inflation factors make the coefficient estimates unstable and unreliable, we prefer to discard the last model.

However we have learnt that the inclusion of nonlinear terms can improve the model. So, next, we try to update the last model by removing variables that are highly correlated to see if we can get a model that is more accurate than the linear model mod3 without having high variance inflation factors. In particular we highlight that a model expressing the weight linearly with respect to a volume would be physically much more meaningfull than one espressing the weight linearly with respect to lengths. For these reasons we opt for removing from the model the two linear variables “length” and “cranial_circumference”

nonlin_mod2 <- update(nonlin_mod, . ~ . - Lunghezza - Cranio)

summary(nonlin_mod2)

Call:
lm(formula = Peso ~ N.gravidanze_CL + Gestazione + Lunghezza2xCranio + 
    LunghezzaxCranio2 + Sesso, data = dataset)

Residuals:
     Min       1Q   Median       3Q      Max 
-1157.23  -183.06   -14.76   163.04  2610.90 

Coefficients:
                    Estimate Std. Error t value Pr(>|t|)    
(Intercept)       -1.408e+03  1.214e+02 -11.598  < 2e-16 ***
N.gravidanze_CL1+  3.268e+01  1.114e+01   2.934  0.00338 ** 
Gestazione         4.202e+01  3.660e+00  11.480  < 2e-16 ***
Lunghezza2xCranio  2.818e-05  1.560e-06  18.069  < 2e-16 ***
LunghezzaxCranio2  1.118e-05  2.228e-06   5.017 5.62e-07 ***
SessoM             7.099e+01  1.112e+01   6.383 2.06e-10 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 272.2 on 2492 degrees of freedom
Multiple R-squared:  0.732, Adjusted R-squared:  0.7315 
F-statistic:  1361 on 5 and 2492 DF,  p-value: < 2.2e-16
kable(car::vif(nonlin_mod2), caption="Variance Inflation factors of the nonlinear model") %>%
  kable_styling()
Variance Inflation factors of the nonlinear model
x
N.gravidanze_CL 1.029988
Gestazione 1.577089
Lunghezza2xCranio 10.012010
LunghezzaxCranio2 9.153736
Sesso 1.042580

After removing the variables length and cranial circumference the variance inflation factor is still high for the two cubic variables. For this reason we remove from the model also the variable \(Lunghezza\times Cranio^2\) which results the less influential.

nonlin_mod3 <- update(nonlin_mod2, . ~ . - LunghezzaxCranio2)

summary(nonlin_mod3)

Call:
lm(formula = Peso ~ N.gravidanze_CL + Gestazione + Lunghezza2xCranio + 
    Sesso, data = dataset)

Residuals:
     Min       1Q   Median       3Q      Max 
-1178.36  -179.61   -14.04   161.78  2790.87 

Coefficients:
                    Estimate Std. Error t value Pr(>|t|)    
(Intercept)       -1.325e+03  1.209e+02 -10.964  < 2e-16 ***
N.gravidanze_CL1+  3.886e+01  1.112e+01   3.494 0.000485 ***
Gestazione         4.088e+01  3.671e+00  11.139  < 2e-16 ***
Lunghezza2xCranio  3.536e-05  6.238e-07  56.679  < 2e-16 ***
SessoM             7.028e+01  1.117e+01   6.290 3.73e-10 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 273.5 on 2493 degrees of freedom
Multiple R-squared:  0.7293,    Adjusted R-squared:  0.7289 
F-statistic:  1679 on 4 and 2493 DF,  p-value: < 2.2e-16
kable(car::vif(nonlin_mod3), caption="Variance Inflation factors of the nonlinear model") %>%
  kable_styling()
Variance Inflation factors of the nonlinear model
x
N.gravidanze_CL 1.017382
Gestazione 1.571103
Lunghezza2xCranio 1.586448
Sesso 1.042414

kable(BIC(mod2, nonlin_mod, nonlin_mod2, nonlin_mod3), caption="Bayesian information factor of the linear mode vs nonlinear ones") %>%
  kable_styling()
Bayesian information factor of the linear mode vs nonlinear ones
df BIC
mod2 7 35196.36
nonlin_mod 9 35135.41
nonlin_mod2 7 35147.43
nonlin_mod3 6 35164.71
NA
NA

The final model predicts the weight of the new born based on gestation time, squared length times cranium circumference, sex and number of deliveries. The model has a BIC value larger than the best model which included all the possible nonlinear terms but smaller than the best model that included only linear terms in the variables. Moreover, differently from the best model that included all the possible nonlinear terms, the model “nonlin_mod3” has all the variance inflation factors smaller than 5.

We point out that the model “nonlin_mod3”, including the mixed term \(length^2\times cranial_circumferece\) but not the linear terms in the variables \(length\) ad \(cranial_circumferece\), may break the principle of marginality. However we highlight that the model is also physically more meaningfull that one including the linear terms in the variables \(length\) and \(cranial_circumference\). For these reasons we accept this model and analyze it. We start analyzing its coefficients

kable(coef(nonlin_mod3), caption="Coefficients of final nonlinear model") %>%
  kable_styling()
Coefficients of final nonlinear model
x
(Intercept) -1325.4187855
N.gravidanze_CL1+ 38.8606976
Gestazione 40.8848655
Lunghezza2xCranio 0.0000354
SessoM 70.2824691

The coefficients say that for each unit increment in the gestation time (1 week) we observe an increment of approximately 40gr in the weigth of the newborn. Similarly for each 1mm^3 increment in the volume term (\(Length^2 \times Cranial_circfumference\)) we register an increment of \(3.5*10^-5\) (gr) in the weight of the newborn. Finally we observe that the males will register approximately 7gr more than females and child after the first one weight approximately 4gr more than a first child.

Next we analyze the residuals, outliers and leverages of the model.

par(mfrow=c(2,2))
plot(nonlin_mod3)


lmtest::bptest(nonlin_mod3)

    studentized Breusch-Pagan test

data:  nonlin_mod3
BP = 13.12, df = 4, p-value = 0.01071
lmtest::dwtest(nonlin_mod3)

    Durbin-Watson test

data:  nonlin_mod3
DW = 1.9523, p-value = 0.1163
alternative hypothesis: true autocorrelation is greater than 0
shapiro.test(nonlin_mod3$residuals)

    Shapiro-Wilk normality test

data:  nonlin_mod3$residuals
W = 0.97047, p-value < 2.2e-16

The analysis shows that there is no autocorrelation of the residuals (p-value of the Durbin-Watson test > 0.05). However we observe that the residuals are not normally distributed (p-value of the Shapiro-Wilk test <0.05) and that there is no heteroscedasticity (p-value of the Breusch-Pagan < 0.05). Additionally we observe that there are some highly influential outlier datapoints having Cook distance close to 0.5.

So we investigate deeper the leverages and outliers. We consider leverages the points having leverage value larger than the threshold. This is set equal to 3 times the leverage mean, i.e. the number of parameters of the model over the number of observations in the dataset. Moreover the outliers are identified by a Bonferroni test on the studentized residuals of the model.


#leverage
lev<-hatvalues(nonlin_mod3)
plot(lev)
p<-sum(lev)
n<-length(lev)
soglia=3*p/n
abline(h=soglia,col=2)
title(main = "Leverage values of model mod1")

leverages=which(lev>soglia)
cat("There are", length(leverages), "leverages\n")
There are 46 leverages
#outliers
plot(rstudent(nonlin_mod3))
abline(h=c(-2,2))
title(main = "Studentized Residuals of model mod1")



cat("The outliers are:\n")
The outliers are:
car::outlierTest(nonlin_mod3)


#distanza di cook
cook<-cooks.distance(nonlin_mod3)
plot(cook,ylim = c(0,1)) 
title(main = "Cook distance of residuals of model mod1")

We note that the most influential outlier, the datapoint 1549, with a Cook distance around 0.2, could have a relevant effect on the estimate of the coefficients of the model but not drastical. By investigating this datapoint we find that this is the same datapoint that we have already met before which has a length much lower than expected, probably for a typo in its transposition.

kable(t(dataset[1549,]), caption="Outlier datapoint") %>%
  kable_styling()
Outlier datapoint
1549
Anni.madre 35
N.gravidanze 1
Fumatrici 0
Gestazione 38
Peso 4370
Lunghezza 315
Cranio 374
Tipo.parto Nat
Ospedale osp3
Sesso F
N.gravidanze_CL 1+
Anni.madre_CL (32,38]
Gestazione2 1444
logGestazione 3.637586
Lunghezza2 99225
Lunghezza2xCranio 37110150
Cranio2 139876
LunghezzaxCranio2 44060940
NA

Next we evaluate the quality of the model in terms of the \(R^2\) and RMSE metrics.

# R^2 :
summary_mod <- summary(nonlin_mod3)
R2 <- summary_mod$r.squared
cat("The R^2 value of the model is equal to", R2, ", so the model is explaining about 73% of the variance of the variable Weight. \n")
The R^2 value of the model is equal to 0.7293134 , so the model is explaining about 73% of the variance of the variable Weight. 
# RMSE:
predictions <- predict(nonlin_mod3, newdata = dataset)
residuals <- dataset$Peso - predictions
RMSE <- sqrt(mean(residuals^2))
perRMSE=RMSE/mean(dataset$Peso)
cat("The RMSE value of the model is equal to", RMSE, ".\n") 
The RMSE value of the model is equal to 273.2093 .
cat("This means that the model has a mean error of prediction of" , paste0(round(perRMSE,2),"%"),"with respect to the mean of the variable weight.\n")
This means that the model has a mean error of prediction of 0.08% with respect to the mean of the variable weight.
cat("Moreover this means that the variance of the prediction errors of the model is about", round(RMSE/sd(dataset$Peso),2), "of the variance of the variable weight")
Moreover this means that the variance of the prediction errors of the model is about 0.52 of the variance of the variable weight

Using the model to make predictions

Assume to have a female baby that will be delivered at the 39th week and whose mother has already delivered two children. Then we can predict her weight. For the new baby we assume a length and cranial circumference equal to the mean weight and cranial circumferene of female newborns at the 39th gestation week.

length= mean(subset(dataset, Sesso == "F" & Gestazione==39)$Lunghezza)
cranium= mean(subset(dataset, Sesso == "F" & Gestazione==39)$Cranio)

new_data <- data.frame("N.gravidanze" = c(2),
                       "Gestazione"=c(39),
                       "Lunghezza"=c(length),
                       "Cranio"=c(cranium),
                       "Sesso"=c("F")
                       )

new_data$Lunghezza2xCranio = (new_data$Lunghezza)^2*(new_data$Cranio)
new_data$Cranio2 <- (new_data$Cranio)^2
new_data$N.gravidanze_CL <- cut(new_data$N.gravidanze, 
                     breaks=c(0,1,100), 
                     right=FALSE, 
                     labels=c("0","1+"))


weight_prediction <- round(predict(nonlin_mod3, newdata = new_data))


cat("The regression model predicts a weight equal to", paste0(weight_prediction,"gr"), ".\n") 
The regression model predicts a weight equal to 3218gr .

Finally we display the relations captured in the model.

First we display how the weight is distributed according to the sex and number of deliveries with respect to the quantitative variables in the model, i.e. \(gestation\_time\) and \((length)^2\times cranial\_circumference\).


plot_fin_1 = ggplot(data=dataset)+
                    geom_point(aes(x=Gestazione,
                                 y=Peso,
                                 col=Sesso),position = "jitter")+
                    geom_smooth(aes(x=Gestazione,
                                    y=Peso,
                                    col=Sesso), se=F, method = "lm")+
                   scale_color_manual(values= sex_colors, 
                                   labels = c("M"= "Male", "F"="Female"))+
                   labs(title = "Weight as a function of Gestation time",
                        x = "gestation time",
                        y = "weight (gr)")+
                    theme_classic()+
                    theme(axis.title.x = element_text(size = 14, margin = margin(t = 10)),
                          axis.title.y = element_text(size = 14, margin = margin(r = 10)),
                          legend.title = element_text(size = 14),
                          legend.text = element_text(size = 14))


legend <- as_ggplot(get_legend(plot_fin_1))

plot_fin_1 = ggplot(data=dataset)+
                    geom_point(aes(x=Gestazione,
                                 y=Peso,
                                 col=Sesso),position = "jitter")+
                    geom_smooth(aes(x=Gestazione,
                                    y=Peso,
                                    col=Sesso), se=F, method = "lm")+
                   scale_color_manual(values= sex_colors, 
                                   labels = c("M"= "Male", "F"="Female"))+
                   labs(title = "Weight as a function of Gestation time",
                        x = "gestation time",
                        y = "weight (gr)")+
                    theme_classic()+
                    theme(axis.title.x = element_text(size = 14, margin = margin(t = 10)),
                          axis.title.y = element_text(size = 14, margin = margin(r = 10)),
                          legend.position = "none")


 plot_fin_3 = ggplot(data=dataset)+
                    geom_point(aes(x=Lunghezza2xCranio,
                                 y=Peso,
                                 col=Sesso))+
                    geom_smooth(aes(x=Lunghezza2xCranio,
                                    y=Peso,
                                    col=Sesso), se=F, method = "lm")+
                   scale_color_manual(values= sex_colors, 
                                   labels = c("M"= "Male", "F"="Female"))+
                    labs(title = "Weight as a function of \n Lengt^2 x cranial circumference",
                        x = "Lengt^2 x cranial circumference (mm^3)",
                        y = "weight (gr)")+
                    theme_classic()+
                    theme(axis.title.x = element_text(size = 14, margin = margin(t = 10)),
                          axis.title.y = element_text(size = 14, margin = margin(r = 10)),
                          legend.position = "none")

  
  
comb_fin <- plot_fin_1 + plot_fin_3 + legend + plot_layout(ncol=3, heights = c(5), widths= c(5,5,1.5))

print(comb_fin)


plot_fin_1 = ggplot(data=dataset)+
                    geom_point(aes(x=Gestazione,
                                 y=Peso,
                                 col=N.gravidanze_CL),position = "jitter")+
                    geom_smooth(aes(x=Gestazione,
                                    y=Peso,
                                    col=N.gravidanze_CL), se=F, method = "lm")+
                   labs(title = "Weight as a function of Gestation time",
                        x = "gestation time",
                        y = "weight (gr)",
                        color = "N° previous deliveries")+
                    theme_classic()+
                    theme(axis.title.x = element_text(size = 14, margin = margin(t = 10)),
                          axis.title.y = element_text(size = 14, margin = margin(r = 10)),
                          legend.title = element_text(size = 14),
                          legend.text = element_text(size = 14))


legend <- as_ggplot(get_legend(plot_fin_1))

plot_fin_1 = ggplot(data=dataset)+
                    geom_point(aes(x=Gestazione,
                                 y=Peso,
                                 col=N.gravidanze_CL),position = "jitter")+
                    geom_smooth(aes(x=Gestazione,
                                    y=Peso,
                                    col=N.gravidanze_CL), se=F, method = "lm")+
                   labs(title = "Weight as a function of Gestation time",
                        x = "gestation time",
                        y = "weight (gr)")+
                    theme_classic()+
                    theme(axis.title.x = element_text(size = 14, margin = margin(t = 10)),
                          axis.title.y = element_text(size = 14, margin = margin(r = 10)),
                          legend.position = "none")

  
 plot_fin_3 = ggplot(data=dataset)+
                    geom_point(aes(x=Lunghezza2xCranio,
                                 y=Peso,
                                 col=N.gravidanze_CL))+
                    geom_smooth(aes(x=Lunghezza2xCranio,
                                    y=Peso,
                                    col=N.gravidanze_CL), se=F, method = "lm")+
                    labs(title = "Weight as a function of \n Lengt^2 x cranial circumference",
                        x = "Lengt^2 x cranial circumference (mm^3)",
                        y = "weight (gr)")+
                    theme_classic()+
                    theme(axis.title.x = element_text(size = 14, margin = margin(t = 10)),
                          axis.title.y = element_text(size = 14, margin = margin(r = 10)),
                          legend.position = "none")

  
  
comb_fin <- plot_fin_1 + plot_fin_3 + legend + plot_layout(ncol=3, heights = c(5), widths= c(5,5,1.5))

print(comb_fin)

The plots confirm that the weight increases while increasing all the other variables in the model. Moreover we observe that the weight observed in males is always higher than the weight observed in females. Differently the difference in weight between first children and not first children is only observable when varying the variable gestation time. In this case we observe that the weight of first children is slightly smaller than the weight of not first children. The observations are in agreements with the discussion about the \(\beta\) coefficients of the model.

---
title: "Statistical_Model_Newborn_weight"
output: html_notebook
---

Statistical Model for Predicting Newborn Weight

Company Context Company: Neonatal Health Solutions 

Objective: Create a statistical model capable of accurately predicting newborn
birth weight, based on clinical variables collected from three
hospitals.   

The project aims to improve the management of high-risk pregnancies, optimizing hospital resources and ensuring better outcomes for neonatal health.   

The variables collected include:   

Maternal Age: Measure of age in years.   

- Number of Pregnancies: How many pregnancies the mother has previously had.   

- Maternal smoking: A binary indicator (0 = non-smoker, 1 = smoker).   

- Pregnancy duration: Number of weeks of gestation.   

- Infant weight: Birth weight in grams.   

- Length and Head circumference: Infant length and
head diameter, which can also be measured during pregnancy via
ultrasound.   

- Type of delivery: Natural or cesarean.   

- Hospital of birth: Hospital 1, 2, or 3.   

- Infant sex: Male (M) or female (F).   


```{r, echo=FALSE, results='hide', message=FALSE, warning=FALSE}
library(dplyr)
library(moments)
library(ggplot2)
library(patchwork)
library(scales)
library(ggpubr)
options(width = 100)
library(knitr)
library(kableExtra)
library(car)
library(ggpubr)
```


```{r, echo=FALSE, results='hide', message=FALSE, warning=FALSE}
# IMPORT THE DATASET
dataset <- read.csv("neonati.csv")

N_observations = dim(dataset)[1]
N_variables = dim(dataset)[2]
dataset$Fumatrici <- factor(dataset$Fumatrici)

attach(dataset)
```


```{r}
kable(summary(dataset), caption="Summary of the variables in the dataset") %>%
  kable_styling()
```

We note that the variable Age of the mother has a minimum value equal to 0 meaning that there are some wrong data. So we first display the distribution of the age of the mothers. 

```{r}
ggplot(data = subset(dataset))+
  geom_bar(aes(x=Anni.madre),
           stat="count",
           col="khaki",
           fill="khaki")+
  labs(title="Distribution of the mother age",
       x="Age",
       y="absolute frequency")+
  scale_y_continuous(breaks = seq(0,1200,200))+
  scale_x_continuous(breaks = seq(0,46,2))+
  theme_classic()+
  theme(axis.title.x = element_text(size = 14, margin = margin(t = 10)),
        axis.title.y = element_text(size = 14, margin = margin(r = 10)))
```
We can assume that the minimum meaningfull age of a mother is 12 years. So we filter the dataset to remove the dirty data corresponding to mothers with an age smaller than 12. 

```{r}
dataset <- dataset %>% 
  filter(Anni.madre >=12)
```



## Next, we start analyzing the anthropometrics data (Weigth, Length and Cranial Circumference) of the newborns in the dataset. 
### First we report a table with the summary of the statistics of the anthropometrics variables

```{r pressure, echo=FALSE}
cv_funct <- function(x) {
  return(sd(x) / mean(x) * 100)
}

kurtosis_index <- function(x) {
  return(kurtosis(x)-3)
}


std.err <- function(x){
  return (sd(x)/sqrt(sum(!is.na(x))))
}


sub_data <- subset(dataset[c("Cranio","Peso","Lunghezza")])#,  Gestazione>=37)

statistics <- data.frame(
        
  quartile_1 = round(apply(sub_data, 2, function(x) quantile(x,probs=0.25, names=FALSE)),2),
                          
  median = round(apply(sub_data, 2, median),2),
                            
  quartile_3 = round(apply(sub_data, 2, function(x) quantile(x,probs=0.75, names=FALSE)),2),
                          
  mean = round(apply(sub_data, 2, mean),2),
                          
  sd = round(apply(sub_data, 2, sd),2),
                          
  cv = round(apply(sub_data, 2, cv_funct),2),
                          
  skewness = round(apply(sub_data, 2, skewness),2),
                          
  kurtosis = round(apply(sub_data, 2, kurtosis_index),2),
  
  std_err = round(apply(sub_data,2,std.err),2)
)


rownames(statistics) <- c("Cranial circumference (mm)", "Weight (gr)", "Length (mm)")

kable(statistics, caption="Indices of weight, length and cranial diameter") %>%
  kable_styling()
```


### Second we display the distribution of the anthropometric variables again a normal distribution

```{r}
# the function normalize takes in input a variable in string form and returns its values normalized by its mean and standard deviation
normalize <- function(x){
  return ((x-mean(x))/sd(x))
}

# we generate a standard normal distribution
gaussian_distribution <- rnorm(1000000,0,1)


# we plot the density of the different variables against a standard normal distribution
ggplot()+
  geom_density(aes(x=gaussian_distribution, fill="N01"), color=NA,  alpha=.8)+
  geom_density(aes(x=normalize(Peso), color="Weight"), linewidth=1.5)+
  geom_density(aes(x=normalize(Cranio), color="Cranial circumference"), linewidth=1.5)+
  geom_density(aes(x=normalize(Lunghezza), color="Length"), linewidth=1.5)+
  labs(title="Distribution of the variables",
       x="Normalized variable ",
       y="density")+
  scale_fill_manual(values = c(N01 = "grey"),   
                    labels = c(N01 = "N(0,1)")) +
  theme_classic()+
  theme(plot.title.position = "panel",
        plot.title=element_text(hjust=0.5))+
  guides(fill = guide_legend(override.aes = list(color = NA)))+
  labs(color = "Variabile",
      fill = "Distribuzione normale")
```

The plots as also the values of skeweness and Kurtosis show that:   

- All the anthropometric variables are more peaked than a normal distribution, so the probability of outliers is smaller that for the normal distribution with same mean and standard deviation.   

- Values in the left tails of the distributions are more likely than values in the right tail. This could be due to the data relative to premature births.   


To check if the asymmetry is due to the premature births we plot the distribution of births by gestational weeks.  


```{r}
ggplot(data=dataset)+
  geom_bar(aes(x=Gestazione),
           stat="count",
           col="pink",
           fill="pink")+
  labs(title="Distribution of gestational age at childbirth",
       x="gestational age (weeks)",
       y="absolute frequency")+
  scale_y_continuous(breaks = seq(0,750,50))+
  scale_x_continuous(breaks = seq(25,43,1))+
  theme_classic()+
  theme(axis.text.x = element_text(angle = 15, hjust = 1),
        axis.title.x = element_text(size = 14, margin = margin(t = 10)),
        axis.title.y = element_text(size = 14, margin = margin(r = 10)))
```


A delivery is considered premature if it occurs before 37 weeks of gestation. By looking at the barplot we note there exist different many premature birthds in the dataset. We expect the anthropometric measurements of premature newborns to be significantly smaller.  

So we filter the data relative only to the non-premature births and study their statistics. 


```{r}
regular_deliveries_data <- dataset %>% filter(Gestazione >= 37)

statistics_regular_deliveries <- data.frame(
        
  quartile_1 = round(apply(regular_deliveries_data[c("Cranio","Peso","Lunghezza")], 2, function(x) quantile(x,probs=0.25, names=FALSE)),2),
                          
  median = round(apply(regular_deliveries_data[c("Cranio","Peso","Lunghezza")], 2, median),2),
                            
  quartile_3 = round(apply(regular_deliveries_data[c("Cranio","Peso","Lunghezza")], 2, function(x) quantile(x,probs=0.75, names=FALSE)),2),
                          
  mean = round(apply(regular_deliveries_data[c("Cranio","Peso","Lunghezza")], 2, mean),2),
                          
  sd = round(apply(regular_deliveries_data[c("Cranio","Peso","Lunghezza")], 2, sd),2),
                          
  cv = round(apply(regular_deliveries_data[c("Cranio","Peso","Lunghezza")], 2, cv_funct),2),
                          
  skewness = round(apply(regular_deliveries_data[c("Cranio","Peso","Lunghezza")], 2, skewness),2),
                          
  kurtosis = round(apply(regular_deliveries_data[c("Cranio","Peso","Lunghezza")], 2, kurtosis_index),2),
  
  std_err = round(apply(regular_deliveries_data[c("Cranio","Peso","Lunghezza")],2,std.err),2)
)

rownames(statistics_regular_deliveries) <- c("Cranial circumference (mm)", "Weight (gr)", "Length (mm)")

kable(statistics_regular_deliveries, caption="Indices of weight, length and cranial diameter of the non-premature births") %>%
  kable_styling()



# we plot the density of the different variables against a standard normal distribution
ggplot()+
  geom_density(aes(x=gaussian_distribution, fill="N01"), color=NA,  alpha=.8)+
  geom_density(aes(x=normalize(regular_deliveries_data$Peso), color="Weight"), linewidth=1.5)+
  geom_density(aes(x=normalize(regular_deliveries_data$Cranio), color="Cranial circumference"), linewidth=1.5)+
  geom_density(aes(x=normalize(regular_deliveries_data$Lunghezza), color="Length"), linewidth=1.5)+
  labs(title="Distribution of the anthropometric variables \n relative to the non-premature deliveries",
       x="Normalized variable ",
       y="density")+
  scale_fill_manual(values = c(N01 = "grey"),   
                    labels = c(N01 = "N(0,1)")) +
  theme_classic()+
  theme(plot.title.position = "panel",
        plot.title=element_text(hjust=0.5))+
  guides(fill = guide_legend(override.aes = list(color = NA)))+
  labs(color = "Variabile",
      fill = "Distribuzione normale")
```

The significantly smaller values of skeweness of all the anthropometric variables corresponding only to the non-premature births confirm that the asymmetry that we had previously observed was mainly due to the presence of premature births. Also the values of the kurtosis have significanlty decreased suggesting a much better normal approximation of the distribution of the anthropometric variables when filtering the data corresponding to the only non-premature deliveries. The length is the only variables that is not really normally distributed. The non-normality of the variable Lenght could be due to the presenze of some corrupted data. 
To this end we study the correlation between the anthropometric variables.


```{r}


cat("The correlation coefficiente of the variable Length with the varibles Weight and Cranial circumference in non premature newborns is equal to: \n", 
    "-cor(Length, Weight) = ", cor(regular_deliveries_data$Lunghezza, regular_deliveries_data$Peso)
 , "\n",
    "-cor(Length, Cranial circumference) = ", cor(regular_deliveries_data$Lunghezza, regular_deliveries_data$Cranio)
)
```
```{r, fig.width=10, fig.height=5}
scatt1 = ggplot(regular_deliveries_data, aes(x = Lunghezza, y = Peso)) + 
                geom_point()+
                theme_classic()+
                theme(axis.title.x = element_text(size = 14, margin = margin(t = 10)),
                      axis.title.y = element_text(size = 14, margin = margin(r = 10)))+
                labs(title="Scatterplot of Lenght against Weight",
                                        x=" Lenght(mm)",
                                        y=" Weight(gr)")+
                scale_y_continuous(breaks = seq(2000,5000,500))+
                scale_x_continuous(breaks = seq(300,600,50))


scatt2 = ggplot(regular_deliveries_data, aes(x = Lunghezza, y = Cranio)) + 
                geom_point()+
                theme_classic()+
                theme(axis.title.x = element_text(size = 14, margin = margin(t = 10)),
                      axis.title.y = element_text(size = 14, margin = margin(r = 10)))+
                labs(title="Scatterplot of Lenght against Cranial circumference",
                                        x=" Lenght(mm)",
                                        y=" Cranial circumference(mm)")+
                scale_x_continuous(breaks = seq(300,600,50))

comb_fig_scatt <- scatt1 + scatt2 + plot_layout(ncol=2, heights = c(5), widths= c(5,5))
print(comb_fig_scatt)
```
From the scatterplots, it can be seen that there is one outlier measurement reporting a length smaller than 350 mm, which could be due to an error in the dataset, such as a typo in the annotation of the newborn's length. 
So we decide to clean the dataset of regular newborns from this datapoint.

```{r}
regular_deliveries_data <- regular_deliveries_data %>% filter(Lunghezza >= 350)

statistics_regular_deliveries <- data.frame(
        
  quartile_1 = round(apply(regular_deliveries_data[c("Cranio","Peso","Lunghezza")], 2, function(x) quantile(x,probs=0.25, names=FALSE)),2),
                          
  median = round(apply(regular_deliveries_data[c("Cranio","Peso","Lunghezza")], 2, median),2),
                            
  quartile_3 = round(apply(regular_deliveries_data[c("Cranio","Peso","Lunghezza")], 2, function(x) quantile(x,probs=0.75, names=FALSE)),2),
                          
  mean = round(apply(regular_deliveries_data[c("Cranio","Peso","Lunghezza")], 2, mean),2),
                          
  sd = round(apply(regular_deliveries_data[c("Cranio","Peso","Lunghezza")], 2, sd),2),
                          
  cv = round(apply(regular_deliveries_data[c("Cranio","Peso","Lunghezza")], 2, cv_funct),2),
                          
  skewness = round(apply(regular_deliveries_data[c("Cranio","Peso","Lunghezza")], 2, skewness),2),
                          
  kurtosis = round(apply(regular_deliveries_data[c("Cranio","Peso","Lunghezza")], 2, kurtosis_index),2),
  
  std_err = round(apply(regular_deliveries_data[c("Cranio","Peso","Lunghezza")],2,std.err),2)
)

rownames(statistics_regular_deliveries) <- c("Cranial circumference (mm)", "Weight (gr)", "Length (mm)")

kable(statistics_regular_deliveries, caption="Indices of weight, length and cranial diameter of the non-premature births") %>%
  kable_styling()
```

We observe that now, in the cleaned dataset, also the variable Length follows approximately a normal distribution, with both kurtosis and skewness coefficients close to 0.


### Analysis of weight, length and cranial diameter of the sample with respect to the population statistics.

According to the study [International standards for newborn weight,
length, and head circumference by gestational age and
sex](https://doi.org/10.1016/S0140-6736(14)60932-6), the birthweight,
birthlength and birth head circumference of babies born in Italy from
the 37th week of gestation on, have the following mean and standard
deviation:

-   birthweight: Mean = 3.3, SD = 0.4,

-   birthlength: Mean = 49.4, SD = 1.7,

-   birth head circumference: Mean = 34.0, SD = 1.2

From the central limit theorem we know that given a random variable
X with mean $\mu_0$ and standard deviation $\sigma$, the mean over a
random sample of N people distributes as a random variable following a
normal distribution with mean $\mu_0$ and standard deviation $\sigma/N$.

So for the three variables weight, length and cranial circumference we
assume as null hypothesis (H0) that they have mean $\mu_0=$ 3300 gr,
49.4 cm and 34 cm and standard deviation $\sigma=$ 400gr, 1.7cm, 1.2cm,
respectively. Then we test this hypotheses using a Z-test. To this end
we consider the sub-dataset corresponding to the births after the 37th
week of gestation, having length $N$. 

Then we compute the Z-test value 
$$Z = \frac{mean(Variable)-\mu_0}{\sigma/\sqrt{N}}.$$


We compute the corresponfing p-values, i.e. the probability of getting a value equal or
smaller than mean(Variable) by randomly sampling N births. If the
p-value is smaller than the level of significance that we set to 0.05,
we reject the hypothesis H0.

```{r}
z_test = function(x, mu0, stdev, alfa){

  x = na.omit(x)
  mu_cap = mean(x)
  n = length(x)
  
  Z = (mu_cap - mu0) / (stdev / sqrt(n))
  valori.soglia = qnorm(c(alfa/2, 1-alfa/2))
  
  CI <- c(mu_cap+qnorm(alfa/2)*(stdev/sqrt(n)),
          mu_cap-qnorm(alfa/2)*(stdev/sqrt(n)))
    
  return(
    list(mean.sample= mu_cap,
         stat.test = Z,
         threshold.val = valori.soglia,
         pvalue = 2*pnorm(-abs(Z)),
         Int.Conf. = CI
         ))
}
```



```{r}
z_test_weight = z_test(regular_deliveries_data$Peso, 3300, 400, 0.05)

z_test_length = z_test(regular_deliveries_data$Lunghezza, 494, 17, 0.05)

z_test_cranial_diam = z_test(regular_deliveries_data$Cranio, 340, 12, 0.05)


cat("The p-values of length, weight and cranial circumference are all smaller than 0.05, so we reject the hypothesis H0 that the three variables are distributed according to the distribution of weight, length and cranial diameter of the italian population. \n", 
    "-p_value(Weight) = ", z_test_weight$pvalue , "\n",
    "-p_value(Length) = ", z_test_length$pvalue , "\n",
    "-p_value(Cranial diameter) = ", z_test_cranial_diam$pvalue )
```

In particular we display the Z-test values of the anthropometric variables against the level of significance of 0.05 of the Z-test.

```{r}
ggplot()+
  geom_density(aes(x=gaussian_distribution, fill="N01"), color=NA,  alpha=.8) +
  geom_vline(aes(xintercept=z_test_weight$stat.test, color="Weight"), linewidth=1.5) +
  geom_vline(aes(xintercept=z_test_length$stat.test, color="Length"), linewidth=1.5)+
  geom_vline(aes(xintercept=z_test_cranial_diam$stat.test, color="Cranial circumference"), linewidth=1.5)+
  geom_vline(aes(xintercept=z_test_length$threshold.val, color="Threshold p=0.05"),linewidth=1, linetype="dashed")+
  scale_color_manual(values= c("Weight"= "gold1", "Length"="gold4", "Cranial circumference"="gold3", "Threshold p=0.05"="red"))+
  scale_fill_manual(values = c(N01 = "grey"),   
                    labels = c(N01 = "N(0,1)")) +
  labs(title="Z values of weight, length and cranial circumference",
       x=" ",
       y=" ")+
  theme_classic()+
  theme(axis.text.x = element_text(angle = 15, hjust = 1),
        axis.title.x = element_text(size = 14, margin = margin(t = 10)),
        axis.title.y = element_text(size = 14, margin = margin(r = 10)),
        legend.position ="bottom")
```


### Analysis of the differences in the antrhopometric variables across the different between males and females.

First we display the distribution of newborns between the two sexes, which result quite uniform.



```{r}
sex_colors <- c("M" = "dodgerblue", "F" = "hotpink")  


ggplot(data=dataset)+
  geom_bar(aes(x=Sesso,
               fill=Sesso,
               color=Sesso),
           stat="count")+
  labs(title="Sex distribution of newborns",
       x="sex",
       y="absolute frequency",
       color="SEX:",
       fill="SEX:")+
  scale_fill_manual(values= sex_colors)+
  scale_y_continuous(breaks = seq(0,1500,100))+
  theme_classic()+
  theme(axis.text.x = element_text(angle = 15, hjust = 1),
        axis.title.x = element_text(size = 14, margin = margin(t = 10)),
        axis.title.y = element_text(size = 14, margin = margin(r = 10)))
```

Second we study the statistics of the anthropometric variables across the two sexes and display them via a boxplot. In order to test later differences across the sexes using a t.test it is important to compare normally distrubuted quantities. For this reason, we restrict our study only to the not-premature deliveries.


```{r}
cv_funct <- function(x) {
  return(sd(x) / mean(x) * 100)
}


conditional_dataset_sex <- regular_deliveries_data %>% 
  group_by(Sesso) %>% 
  summarise(mean_peso = round(mean(Peso),2),
            sd_peso = round(sd(Peso),2),
            cv_peso = round(cv_funct(Peso), 2),
            skew_peso = round(skewness(Peso), 2),
            kurt_peso = round(kurtosis_index(Peso), 2),
            
            mean_lunghezza = round(mean(Lunghezza),2),
            sd_lunghezza = round(sd(Lunghezza),2),
            cv_lunghezza = round(cv_funct(Lunghezza), 2),
            skew_lunghezza = round(skewness(Lunghezza), 2),
            kurt_lunghezza = round(kurtosis_index(Lunghezza), 2),
            
            mean_cranio = round(mean(Cranio),2),
            sd_cranio = round(sd(Cranio),2),
            cv_cranio = round(cv_funct(Cranio), 2),
            skew_cranio = round(skewness(Cranio), 2),
            kurt_cranio = round(kurtosis_index(Cranio), 2))


colnames(conditional_dataset_sex) <- c("sex", "mean", "sd", "cv", "Skew", "kurt" , "mean", "sd", "cv" , "Skew", "kurt",  "mean", "sd", "cv", "Skew", "kurt")

kable(conditional_dataset_sex, caption="Indices of weight, length and cranial diameter of non-premature newborns, divided by sex") %>%
  kable_styling(bootstrap_options = "condensed", full_width = TRUE) %>%
  add_header_above(c(" "=1, "Weight" = 5, "Length" = 5, "Cranial circumference" = 5)) %>%
  row_spec(1, background = "pink")%>%
  row_spec(2, background = "lightblue") %>%
  column_spec(c(1,6,11), border_right = TRUE)
```



```{r, fig.width=15, fig.height=6}

box_length_sex <- ggplot(data=regular_deliveries_data,
                     aes(x=Sesso,
                        y=Lunghezza,
                        fill=Sesso),
                        color=Sesso)+
                     geom_boxplot()+
                     labs(title="Length ditribution according to sex\n of non premature newborns",
                          x=" ",
                          y="length (cm)")+
                     scale_fill_manual(values= sex_colors)+
                     theme_classic()+
                     theme(plot.title.position = "panel",
                           plot.title = element_text(size=16, hjust=0.5),
                           axis.title.x = element_text(size=14),
                           axis.title.y = element_text(size=14),
                           axis.text.x = element_text(size=14),
                           axis.text.y = element_text(size=10),
                           legend.position = "none")+
                      scale_y_continuous(breaks = seq(310, 560, by = 25))+
                      scale_x_discrete(labels=c("M"="Male", "F"="Female"))


box_weight_sex <- ggplot(data=regular_deliveries_data,
                     aes(x=Sesso,
                        y=Peso,
                        fill=Sesso),
                        color=Sesso)+
                     geom_boxplot()+
                     labs(title="Weight ditribution according to sex\n of non premature newborns",
                          x=" ",
                          y="weight (Kg)")+
                     scale_fill_manual(values= sex_colors)+
                     theme_classic()+
                     theme(plot.title.position = "panel",
                           plot.title = element_text(size=16, hjust=0.5),
                           axis.title.x = element_text(size=14),
                           axis.title.y = element_text(size=14),
                           axis.text.x = element_text(size=14),
                           axis.text.y = element_text(size=10),
                           legend.position = "none")+
                      scale_y_continuous(breaks = seq(1000, 5000, by = 400),
                                        labels = seq(1.0, 5.0, by = 0.4))+
                      scale_x_discrete(labels=c("M"="Male", "F"="Female"))


box_head_sex <- ggplot(data=regular_deliveries_data,
                     aes(x=Sesso,
                        y=Cranio,
                        fill=Sesso),
                        color=Sesso)+
                     geom_boxplot()+
                     labs(title="Cranial diameter ditribution according to sex\n of non premature newborns",
                          x=" ",
                          y="diameter (cm)")+
                     scale_fill_manual(values= c("M"="dodgerblue", "F"="hotpink"))+
                     theme_classic()+
                     theme(plot.title.position = "panel",
                           plot.title = element_text(size=16, hjust=0.5),
                           axis.title.x = element_text(size=14),
                           axis.title.y = element_text(size=14),
                           axis.text.x = element_text(size=14),
                           axis.text.y = element_text(size=10),
                           legend.position = "none")+
                      scale_y_continuous(breaks = seq(250, 400, by = 25))+
                      scale_x_discrete(labels=c("M"="Male", "F"="Female"))


comb_fig_antropometrics <- box_head_sex + box_weight_sex + box_length_sex + plot_layout(ncol=3, heights = c(5), widths= c(5,5,5))
print(comb_fig_antropometrics)
```
The data suggest a difference in the anthropometric variables across the two sexes, with higher values for all the variables in case of males rather than females. 

### We test the hypothesis that the antropometric values of males and females are significantly different

In order to perform a meaningfull t-test that compares the distributions of anthropometric measurements of males and females we first have to check that the distributions of the antrhopometric measurements of males and females are approximatelly normal and have similar variances. 
The previous table suggest that this should be true if we restrict our study only to the non-premature deliveries. 
We perform Q-Q tests and Shapiro Wilk tests to verify that the three variables distribute according to normal distributions for both sexes. We recall the Q-Q plots display the observed quantiles of the variables against the quantiles of a Gaussian distribution. So, if the variables come from a normal distribution, the points in the Q-Q plot will approximately lie along a straight line.


```{r, fig.width=10, fig.height=15}

males <- regular_deliveries_data %>%
  filter(Sesso=="M")

females <- regular_deliveries_data %>%
  filter(Sesso=="F" & Lunghezza>400)


## QQ-plots of the length
QQ_length_males = ggplot(males, aes(sample = Lunghezza)) +
  labs(title="Q-Q Plot of the length\n of male newborns",
            x="Theoretical Quantiles",
            y="Observed Quantiles")+
  stat_qq(color = "dodgerblue") +
  stat_qq_line(color="red") +
  theme_classic()

QQ_length_females = ggplot(females, aes(sample = Lunghezza)) +
  labs(title="Q-Q Plot of the length\n of female newborns",
            x="Theoretical Quantiles",
            y="Observed Quantiles")+
  stat_qq(color = "hotpink") +
  stat_qq_line(color="black") +
  theme_classic()

## QQ_plots of the weight
QQ_weight_males = ggplot(males, aes(sample = Peso)) +
  labs(title="Q-Q Plot of the weight\n of male newborns",
            x="Theoretical Quantiles",
            y="Observed Quantiles")+
  stat_qq(color = "dodgerblue") +
  stat_qq_line(color="red") +
  theme_classic()

QQ_weight_females = ggplot(females, aes(sample = Peso)) +
  labs(title="Q-Q Plot of the weight\n of female newborns",
            x="Theoretical Quantiles",
            y="Observed Quantiles")+
  stat_qq(color = "hotpink") +
  stat_qq_line(color="black") +
  theme_classic()

QQ_cranial_males = ggplot(males, aes(sample = Cranio)) +
  labs(title="Q-Q Plot of the cranial circumference\n of male newborns",
            x="Theoretical Quantiles",
            y="Observed Quantiles")+
  stat_qq(color = "dodgerblue") +
  stat_qq_line(color="red") +
  theme_classic()

QQ_cranial_females = ggplot(females, aes(sample = Cranio)) +
  labs(title="Q-Q Plot of the cranial circumference\n of female newborns",
            x="Theoretical Quantiles",
            y="Observed Quantiles")+
  stat_qq(color = "hotpink") +
  stat_qq_line(color="black") +
  theme_classic()

comb_QQ <- QQ_length_males + QQ_length_females +
                  QQ_weight_males + QQ_weight_females +
                  QQ_cranial_males + QQ_cranial_females +
                  plot_layout(ncol=2, heights = c(5,5,5), widths= c(5,5))

print(comb_QQ)


shaptestLM = shapiro.test(males$Lunghezza)
shaptestLF = shapiro.test(females$Lunghezza)
shaptestWM = shapiro.test(males$Peso)
shaptestWF = shapiro.test(females$Peso)
shaptestCM = shapiro.test(males$Cranio)
shaptestCF = shapiro.test(females$Cranio)

cat("The maximum Shapiro-Wilk statistics of the 6 normality tests about distribution of weight, length and cranial circumference across the two sexes is:", min(shaptestLM$statistic, shaptestLF$statistic, shaptestWM$statistic, shaptestWF$statistic, shaptestCM$statistic, shaptestCF$statistic), "\n")

cat("The minimum Shapiro-Wilk p.value of the 6 normality tests about distribution of weight, length and cranial circumference across the two sexes is:", min(shaptestLM$p.value, shaptestLF$p.value, shaptestWM$p.value, shaptestWF$p.value, shaptestCM$p.value, shaptestCF$p.value))
```

From the Q-Q plots and the Shapiro-Wilk tests we can argue that the anthropometric variables, even if not exactly distributed according to normal distributions, are really close to normal distributions. So we can perform a t.test to verify the null hypothesis that the distribution followed by males and females is the same.

```{r}
ttest_length = t.test(males$Lunghezza, females$Lunghezza,  
                 alternative = "two.sided", 
                 mu = 0, 
                 var.equal = TRUE,
                 conf.level = 0.95,
                 paired = FALSE) 

ttest_weight = t.test(males$Peso, females$Peso,  
                 alternative = "two.sided", 
                 mu = 0, 
                 var.equal = TRUE,
                 conf.level = 0.95,
                 paired = FALSE) 

ttest_cranium = t.test(males$Cranio, females$Cranio,  
                 alternative = "two.sided", 
                 mu = 0, 
                 var.equal = TRUE,
                 conf.level = 0.95,
                 paired = FALSE) 

cat("The p-values of the t-tests about length, weight and cranial circumference distribution comparison between males and females are: \n", 
    "-p_value(Weight) = ", ttest_weight$p.value , "\n",
    "-p_value(Length) = ", ttest_length$p.value , "\n",
    "-p_value(Cranial diameter) = ", ttest_cranium$p.value )

```
The very small p-values of the three t.tests show that there is a significant statistical difference in the mean of the anthropometric variables between males and females. In particular we can refuse the null hypothesis H0 and assert that the distributions of the anthropometric measurements followed by males and females are different.



## Analysis of cesarean deliveries

Now we analyze the number of cesarean deliveries across the different hospitals.
First we study the the absolute and relative frequency of each hospital. 

```{r}
#create a conditional dataset with the number of deliveries and cesarean deliveries per hospital
conditional_dataset_ospedale <- 
  dataset %>% 
  group_by(Ospedale) %>%  
  summarise(cesareo = sum(Tipo.parto == "Ces", na.rm = TRUE),
            naturale = sum(Tipo.parto == "Nat", na.rm = TRUE),
            parti= sum())
```

```{r}
Freq_ass_hosp = table(dataset["Ospedale"])
Freq_rel_hosp = table(dataset["Ospedale"])/N_observations
Freq_hosp = cbind(Freq_ass_hosp,Freq_rel_hosp)
rownames(Freq_hosp) <- c("Hospital1","Hospital2","Hospital3")
colnames(Freq_hosp) <- c("Absolute frequency","Relative frequency")
kable(t(Freq_hosp), caption = 'Deliveries distribution across hospitals') %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
```
We have observed that the distribution of the deliveries across the three hospitals is essentially uniform.

Next we use a boxplot to display the percentage of cesarean deliveries in each hospital.

```{r, fig.width=10, fig.height=5}

hospitals=c("osp1" = "Hospital1", "osp2"="Hospital2", "osp3"="Hospital3")

dataset$Tipo.parto <- factor(dataset$Tipo.parto, levels = c("Nat", "Ces"))

bar_cesarean_hospital =  ggplot(data=dataset)+
                                geom_bar(aes(x=Ospedale,
                                             fill=Tipo.parto),
                                         position="fill")+
                                labs(title="Percentage of cesarean childbirths by hospitals",
                                     x="",
                                     y="Percentage",
                                     fill="type of delivery:")+
                                scale_fill_manual(values= c("Ces"="red", "Nat"="ivory2"), labels = c("Ces"= "Cesarean", "Nat"="Natural"))+
                                theme_classic()+
                                theme(plot.title.position = "panel",
                                      plot.title = element_text(size=14, hjust=0.5),
                                      axis.title.x = element_text(size=12),
                                      axis.title.y = element_text(size=12),
                                      axis.text.x = element_text(size=10),
                                      axis.text.y = element_text(size=10),
                                      legend.position = "bottom",
                                      legend.title = element_text(size = 12),
                                      legend.text = element_text(size = 12))+
                                 scale_y_continuous(labels = percent_format(),
                                           breaks = seq(0, 1, by = 0.1))+
                                 scale_x_discrete(labels=hospitals)


print(bar_cesarean_hospital)
```
The boxplot show that in all the three hospitals the percentage of cesarean deliveries is approximately of 30%.  In particular, the number of cesarean deliveries seems to be marginally lower in the Hospital3 with respect to Hospital1 and Hospital2.


### Next we test the null hypothesis that the distribution of cesarean deliveries is the same and so uniform across the three hospitals

We first use a Pearson’s Chi-squared test to examine the hypothesis that
the proportion of cesarean deliveries is equal across the hospitals. It
compares the observed frequencies of cesarean and natural deliveries in
each hospital with the expected frequencies calculated under the null
hypothesis of uniform distribution of cesarean deliveries across
hospitals, literally:
$$Expected\_Freq\_Ces(Hosp) = Prob(Cesarean)* Number\_Deliveries(Hosp) $$
where $Prob(Cesarean)=N\_tot(Cesarean)/N\_tot(deliveries)$. The test
assesses whether the observed differences are statistically significant
by providing a chi-squared statistic and a p-value to support or reject
the null hypothesis.

```{r}
# create contingency table bewtween hospitals and kind of delivery
table_delivery_hospital = table(Ospedale, Tipo.parto)

#The Pearson’s Chi-squared test is performed 
chisq_test = chisq.test(table_delivery_hospital)
cat("The p-value of the Pearson's Chi-squared test is" , chisq_test$p.value)
```

<!-- # dataset$Tipo.parto.num = ifelse(Tipo.parto=="Nat",1,0) -->
<!-- #  -->
<!-- #  -->
<!-- # pairwise.t.test(dataset$Tipo.parto.num, Ospedale, -->
<!-- #                 paired = F, -->
<!-- #                 pool.sd = T, -->
<!-- #                 p.adjust.method = "none") -->

Based on the p-value of the Pearson's Chi-squared we can not reject the null-hypothesis.

The test Chi-squared that we have performed is based on the assumption
that for sufficiently many observations the distribution of the
frequencies of cesarean deliveries at each hospital is close to a normal
distribution, because of the central limit theorem. However, in this
case, we know that the number of cesarean deliveries at the $i$-th
hospital follows a binomial distribution $B(p_i)$. So we can also
perform a binomial test for each hospital to examine the hypothesis that
$p_i=Prob(Cesarean)$ for each hospital. $Prob(Cesarean)$ as defined
above is assumed as an accurate approximation of probability of cesarean delivery if the distribution is uniform across the hospitals, in particular since the distribution of deliveries per hospital is quite uniform there is no need to use a weighted formula for the probability $Prob(Cesarean)$. We report the the $p$-value and the confidence interval of this test for each hospital.

```{r}
# first we create a table with the total number of deliveries and cesarean deliveries observed at each hospital.

sub_dataset = subset(dataset, Ospedale=="osp1")
N_ces_hosp = sum(sub_dataset$Tipo.parto=="Ces")
N_deliveries_hosp = nrow(sub_dataset)

Freq_ces_hosp = table(Ospedale[Tipo.parto=="Ces"])
Ces_hosp = cbind(Freq_ass_hosp,Freq_ces_hosp)
Ces_hosp = as.data.frame(Ces_hosp)

rownames(Ces_hosp) <- c("Hospital_1", "Hospital_2", "Hospital_3")

# Next we compute Prob(cesarean) that should be a good approximation of the probability of oberving a cesarean delivery assuming this is uniform across the three hospitals.

prob_cesarean = sum(Tipo.parto=="Ces")/N_observations

# Last we perform a binomial test for each hospital
p_values <- numeric(nrow(Ces_hosp))
conf_int <- numeric(nrow(Ces_hosp))

for(i in seq_len(nrow(Ces_hosp))){
  test <- binom.test(Ces_hosp$Freq_ces_hosp[i], Ces_hosp$Freq_ass_hosp[i], p = prob_cesarean, alternative = "two.sided")
  p_values[i] <- test$p.value
  conf_int[i] <- paste0("(", round(test$conf.int[1],2), ",", round(test$conf.int[2],2),")")  
}

Ces_hosp$p_value <- p_values
Ces_hosp$conf_int <- conf_int

kable(t(Ces_hosp[, c("p_value", "conf_int")]), caption = 'P-values and confidence intervals for the binomial outcomes of each hospital,1n calculated under the assumption of a binomial distribution with a uniform probability\n of cesarean deliveries across all hospitals.') %>%
  kable_styling(bootstrap_options = c("striped", "hover"), position="left")
```
Also in this case the p-value of the binomial tests do not allow to reject the null hypothesis asserting that the number of deliveries across the three hospitals is uniform.


## Next we analyze the frequency of smoking mothers.


```{r}
Freq_rel_smoker = sum(dataset$Fumatrici == 1)/N_observations

cat("The percentage of smoking mothers is:", paste0(Freq_rel_smoker,"%"))
```
Moreover we investigate if the smoke influence some of the other variables, including weight, length and Cranial circumference and gestational age.

```{r, fig.width=10, fig.height=10}

boxplot_smoking_1= ggplot(data=dataset,
                           aes(x=Fumatrici,
                               y=Peso,
                               fill=Fumatrici))+
                           geom_boxplot()+
                           labs(title="Weight distribution",
                                x=" ",
                                y="weight (kg)")+
                           scale_fill_manual(values= c("0"="green", "1"="grey"), 
                                             labels = c("0"= "Non-smoker", "1"="Smoker"))+
                           theme_classic()+
                           theme(plot.title.position = "panel",
                                 plot.title = element_text(size=16, hjust=0.5),
                                 axis.title.x = element_text(size=14),
                                 axis.title.y = element_text(size=14),
                                 axis.text.x = element_text(size=14),
                                 axis.text.y = element_text(size=10),
                                 legend.position = "none")+
                                 scale_y_continuous(labels = seq(1,5,0.5),
                                                     breaks = seq(1000, 5000, by = 500))+
                                 scale_x_discrete(labels=c("Non-smoker", "Smoker"))


boxplot_smoking_2 = ggplot(data=dataset,
                           aes(x=Fumatrici,
                               y=Lunghezza,
                              fill=Fumatrici))+
                           geom_boxplot()+
                           labs(title="Length distribution",
                                x=" ",
                                y="Length (cm)")+
                           scale_fill_manual(values= c("0"="green", "1"="grey"), 
                                             labels = c("0"= "Non-smoker", "1"="Smoker"))+
                           theme_classic()+
                           theme(plot.title.position = "panel",
                                 plot.title = element_text(size=16, hjust=0.5),
                                 axis.title.x = element_text(size=14),
                                 axis.title.y = element_text(size=14),
                                 axis.text.x = element_text(size=14),
                                 axis.text.y = element_text(size=10),
                                 legend.position = "none")+
                                 scale_y_continuous(breaks = seq(300,600,50),
                                                    labels = seq(30, 60, 5))+
                                 scale_x_discrete(labels=c("Non-smoker", "Smoker"))


boxplot_smoking_3 = ggplot(data=dataset,
                           aes(x=Fumatrici,
                               y=Cranio,
                              fill=Fumatrici))+
                           geom_boxplot()+
                           labs(title="Cranial circumference distribution",
                                x=" ",
                                y="Cranial circumference (cm)")+
                           scale_fill_manual(values= c("0"="green", "1"="grey"), 
                                             labels = c("0"= "Non-smoker", "1"="Smoker"))+
                           theme_classic()+
                           theme(plot.title.position = "panel",
                                 plot.title = element_text(size=16, hjust=0.5),
                                 axis.title.x = element_text(size=14),
                                 axis.title.y = element_text(size=14),
                                 axis.text.x = element_text(size=14),
                                 axis.text.y = element_text(size=10),
                                 legend.position = "none")+
                                 scale_y_continuous(labels = seq(25,40,5),
                                                     breaks = seq(250, 400, by = 50))+
                                 scale_x_discrete(labels=c("Non-smoker", "Smoker"))



boxplot_smoking_4 = ggplot(data=dataset,
                           aes(x=Fumatrici,
                               y=Gestazione,
                              fill=Fumatrici))+
                           geom_boxplot()+
                           labs(title="Gestational time distribution",
                                x=" ",
                                y="Gestational time")+
                           scale_fill_manual(values= c("0"="green", "1"="grey"), 
                                             labels = c("0"= "Non-smoker", "1"="Smoker"))+
                           theme_classic()+
                           theme(plot.title.position = "panel",
                                 plot.title = element_text(size=16, hjust=0.5),
                                 axis.title.x = element_text(size=14),
                                 axis.title.y = element_text(size=14),
                                 axis.text.x = element_text(size=14),
                                 axis.text.y = element_text(size=10),
                                 legend.position = "none")+
                                 scale_y_continuous(breaks = seq(26, 42, by = 2))+
                                 scale_x_discrete(labels=c("Non-smoker", "Smoker"))

comb_smoke <- boxplot_smoking_1 + boxplot_smoking_2 + boxplot_smoking_3 + boxplot_smoking_4 +    plot_layout(ncol=2, heights = c(5,5), widths= c(5,5))

print(comb_smoke)

```
The boxplot may suggest a difference in the measurements and gestational time between smoking and non smoking mothers. We test this hypothesis using a t.test.


```{r}
Smokers_data <- dataset %>%
  filter(Fumatrici=="1")

Not_smokers_data <- dataset %>%
  filter(Fumatrici=="0")

ttest_smoke_weight = t.test(Smokers_data$Peso, Not_smokers_data$Peso,  
                 alternative = "two.sided", 
                 mu = 0, 
                 var.equal = TRUE,
                 conf.level = 0.95,
                 paired = FALSE) 

ttest_smoke_length = t.test(Smokers_data$Lunghezza, Not_smokers_data$Lunghezza,  
                 alternative = "two.sided", 
                 mu = 0, 
                 var.equal = TRUE,
                 conf.level = 0.95,
                 paired = FALSE) 


ttest_smoke_cranium = t.test(Smokers_data$Cranio, Not_smokers_data$Cranio,  
                 alternative = "two.sided", 
                 mu = 0, 
                 var.equal = TRUE,
                 conf.level = 0.95,
                 paired = FALSE) 

ttest_smoke_gestation = t.test(Smokers_data$Gestazione, Not_smokers_data$Gestazione,  
                 alternative = "two.sided", 
                 mu = 0, 
                 var.equal = TRUE,
                 conf.level = 0.95,
                 paired = FALSE) 

cat("The p-values of the t-tests about length, weight, cranial circumference and gestation time distribution comparison between smoking and non-smoking mothers are: \n", 
    "-p_value(Weight) = ", ttest_smoke_weight$p.value , "\n",
    "-p_value(Length) = ", ttest_smoke_length$p.value , "\n",
    "-p_value(Cranial diameter) = ", ttest_smoke_cranium$p.value, "\n",
    "-p_value(Cranial diameter) = ", ttest_smoke_gestation$p.value)
```
The t.tests do not allow us to reject the hypothesis that the gestational time and antropometric measurments of newborns are statistically different according to smoking and not-smoking mothers.


## Next we analyze the number of deliveries and relation with the age of the mothers.

First we explore the variable number of deliveries.

```{r}
Freq_abs_N.deliveries = table(dataset["N.gravidanze"]) 
Freq_rel_N.deliveries = table(dataset["N.gravidanze"])/N_observations

Freq_N.deliveries=data.frame(N.gravidanze = names(Freq_abs_N.deliveries),
                             Freq_abs = as.numeric(Freq_abs_N.deliveries),
                             Freq_rel = paste(round(Freq_rel_N.deliveries,2),"%"))

Freq_N.deliveries$N.gravidanze <- factor(Freq_N.deliveries$N.gravidanze, 
                                        levels=as.character(sort(as.numeric(unique(Freq_N.deliveries$N.gravidanze)))))


ggplot(data=Freq_N.deliveries)+
  geom_bar(aes(x=N.gravidanze, y=Freq_abs),
           stat="identity",
           col="aquamarine",
           fill="aquamarine")+
  labs(title="Number of previous pregnancies",
       x="",
       y="absolute frequency")+
  scale_y_continuous(breaks = seq(0,1200,200))+
  geom_text(aes(x=N.gravidanze, y=Freq_abs+50, label = Freq_rel))+
  theme_classic()+
  theme(axis.title.x = element_text(size = 14, margin = margin(t = 10)),
        axis.title.y = element_text(size = 14, margin = margin(r = 10)))
```

We observe that most of deliveries correspond to first deliveries and that less than 10% of the mothers have 4 children or more.


Next we investigate if there is a correlation between the number of children and the age of the mother. To this end we split the dataset in classes according to the Number of deliveries and years of the mother and we display the dstribution according to the cross classes.

```{r}
dataset$N.gravidanze_CL <- cut(dataset$N.gravidanze, 
                     breaks=c(0,1,2,3,100), 
                     right=FALSE, 
                     labels=c("0", "1","2","3+"))


dataset$Anni.madre_CL <- cut(dataset$Anni.madre, 
                     breaks=c(11,20,26,32,38,46))

dataset$N.gravidanze_CL <- cut(dataset$N.gravidanze, 
                     breaks=c(0,1,2,3,100), 
                     right=FALSE, 
                     labels=c("0","1","2","3+"))

cont_table_year_N.delivery = table(dataset$Anni.madre_CL, dataset$N.gravidanze_CL)

ggpubr::ggballoonplot(data=as.data.frame(cont_table_year_N.delivery),
                      fill="khaki")
```

<!-- #  -->
<!-- #  -->
<!-- # freq_ass_age <- table(dataset$Anni.madre_CL) -->
<!-- #  -->
<!-- # count <- sum((dataset$Anni.madre > 10) & (dataset$Anni.madre <= 46)) -->
<!-- #  -->
<!-- # freq_rel_age <- round(freq_ass_age/count,3) -->
<!-- #  -->
<!-- # freq_age=as.data.frame(cbind(freq_ass_age, freq_rel_age)) -->
<!-- #  -->
<!-- # kable(freq_age, caption="Distribution of the deliveries for classes of age") %>% -->
<!-- #   kable_styling(bootstrap_options = c("striped", "hover"), position="center") -->


<!-- ggplot(data = subset(dataset, !is.na(Anni.madre_CL)))+ -->
<!--   geom_bar(aes(x=Anni.madre_CL), -->
<!--            stat="count", -->
<!--            col="khaki", -->
<!--            fill="khaki")+ -->
<!--   labs(title="Distribution of the mother age", -->
<!--        x="Age", -->
<!--        y="absolute frequency")+ -->
<!--   scale_y_continuous(breaks = seq(0,1200,200))+ -->
<!--   theme_classic()+ -->
<!--   theme(axis.title.x = element_text(size = 14, margin = margin(t = 10)), -->
<!--         axis.title.y = element_text(size = 14, margin = margin(r = 10))) -->


As expected we observe that an increasing number of children usually correspond to older mothers. We test the hypothesis the the distribution of number of childer is not uniform across the ages via a Pearson chi-squared test.

```{r}
chisq_test = chisq.test(cont_table_year_N.delivery)
cat("The p-value of the Pearson's Chi-squared test is" , chisq_test$p.value) 
```

The Pearson's Chi squared test confirms the hypothesis that the distribution of number of deliveries per age is not uniform. 

## Creation of a regression model for the weight of the newborn.
We start studying the correlations between the different variables and the variable weight. First we display the correlations between the different quantitative variable and the variable weight.

```{r}
# panel.cor <- function(x, y, digits = 2, prefix = "", cex.cor, ...)
# {
#   #usr <- par("usr"); on.exit(par(usr))
#   #par(usr = c(0, 1, 0, 1))
#   r <- (cor(x, y))
#   txt <- format(c(r, 1), digits = digits)[1]
#   txt <- paste0(prefix, txt)
#   if(missing(cex.cor)) cex.cor <- 0.8/strwidth(txt)
#   text(0.5, 0.5, txt, cex = 1.5)
# }

panel.cor <- function(x, y, digits = 2, prefix = "", cex.cor, ...)
{
  r <- cor(x, y, use = "complete.obs")
  txt <- format(c(r, 1), digits = digits)[1]
  txt <- paste0(prefix, txt)
  
  usr <- par("usr") # leggi coordinate correnti
  x_pos <- mean(usr[1:2])  # centro asse x
  y_pos <- mean(usr[3:4])  # centro asse y
  
  if(missing(cex.cor)) cex.cor <- 10/strwidth(txt)
  text(x_pos, y_pos, txt, cex = 1.5, ...)
}


#correlazioni
pairs(dataset[, c("Anni.madre","N.gravidanze", "Gestazione", "Peso", "Lunghezza", "Cranio")], lower.panel=panel.cor, upper.panel=panel.smooth)
```

The correlation matrix suggest a medium to high correlation between Weight and Length/Cranium circumference and gestation time. Differently, the weight seems not to be correlated to the number of deliveries and age of the mother. Additionally we have already observed that there is not a significant correlation between the weight of the newborn and the fact that the mother is a smoker. Differently we have observed a strong correlation between the sex of the newborn and its weight. 

The scatter plots also suggest a possible nonlinear dependance of the weight of the newborns with respect to gestation time and possibly Length and cranium circumference.

Even if this is not observable from the correlation matrix, the literature suggest that there could be a difference in the weight of the newborns between the first child and the subsequent ones (see [International standards for newborn weight, length, and head circumference by gestational age and sex](https://doi.org/10.1016/S0140-6736(14)60932-6)), for this reason we split the dataset according to the Number of delivery classes "0" and "1+" and use this as a qualitative variable in the investigation of a good regression model fo the weight of the newborns.

```{r}
dataset$N.gravidanze_CL <- cut(dataset$N.gravidanze, 
                     breaks=c(0,1,100), 
                     right=FALSE, 
                     labels=c("0","1+"))
```


<!-- ggplot(data=dataset, -->
<!--        aes(x=N.gravidanze_CL, -->
<!--            y=Peso, -->
<!--            fill=N.gravidanze_CL))+ -->
<!--        geom_boxplot(color="black")+ -->
<!--        labs(title="Weight distribution according to Number of deliveries", -->
<!--             x=" ", -->
<!--             y="weight (kg)")+ -->
<!--        scale_fill_manual(values= c("0"="aquamarine", "1+"="aquamarine3"))+ -->
<!--        theme_classic()+ -->
<!--        theme(plot.title.position = "panel", -->
<!--              plot.title = element_text(size=16, hjust=0.5), -->
<!--              axis.title.x = element_text(size=14), -->
<!--              axis.title.y = element_text(size=14), -->
<!--              axis.text.x = element_text(size=14), -->
<!--              axis.text.y = element_text(size=10), -->
<!--              legend.position = "none")+ -->
<!--              scale_y_continuous(labels = seq(1,5,0.5), -->
<!--                                  breaks = seq(1000, 5000, by = 500))+ -->
<!--              scale_x_discrete(labels=c("1st", "2nd", "3rd+")) -->



<!-- ```{r} -->
<!-- pairwise.t.test(dataset$Peso, dataset$N.gravidanze_CL, p.adjust.method = "bonferroni")  -->
<!-- ``` -->


We investiagte the best linear regression model using a stepwise selection procedure. We start from the full model from which we omit the variables Hospital and Kind of delivery since there is no reason why these variables should influence the weight of the newborn. Then we subtract or add back the variables that seems less influential. Finally we select the best model based on the BIC metrics.  

```{r}
mod_tot<-lm(Peso ~ Anni.madre + N.gravidanze_CL + Fumatrici + Gestazione + Lunghezza + Cranio + Sesso, data=dataset)
summary(mod_tot)
```
The variables "age", "N.deliveries" and "smoking" seeem to be the less significant. So we start removing the variable "age" from the model.

```{r}
mod1<-lm(Peso ~  N.gravidanze_CL + Fumatrici + Gestazione + Lunghezza + Cranio + Sesso, data=dataset)
summary(mod1)
```
cat("After removing the variable \"age\" the variables \"N.deliveries\" and \"smoking\" are still not significant.\n" )
cat("So we remove the variable \"smoking\" from the model.\n" )


```{r}
mod2<-lm(Peso ~ N.gravidanze_CL + Gestazione + Lunghezza + Cranio + Sesso, data=dataset)
summary(mod2)
```


After removing the variable smoking the variables N.deliveries is still little significant.
So we remove also this variable from the model.


```{r}
mod3<-lm(Peso ~ Gestazione + Lunghezza + Cranio + Sesso, data=dataset)
summary(mod3)
```


In the final model all the remaining variables seem very significant so we don't remove any further variable.
We also note that the adjusted R-squared has remained unchanged after removing the variables "age" and "smoking" while has only slightly reduced when removing the variable "N.deliveries".
Next we test the different models that we have considered by looking at the analysis of variance between each model and the subsequent one and then we look at the BIC value of the four models.


```{r}
kable(anova(mod_tot, mod1, mod2, mod3), caption="Analysis of Variance of each model with respect to the previous one") %>%
  kable_styling()
```


The analysis of variance between each model and the subsequent one shows that every time we have removed a variable the residual sum of squares has reduced, however the reduction has been statistically significant only once that we have removed all the three variables.


```{r}
kable(BIC(mod_tot, mod1, mod2, mod3), caption="Bayesian Information Criterion (BIC) of the four models considered") %>%
  kable_styling()
```


Including only linear terms in the variables we find that the best model, according to the BIC values, is the one that predicts the weight of the newborn based only on the variables Length, Cranial circumference, sex and gestation time, followed closely by the model that included also the dependence upon the number of deliveries.

Next we start from the best model that we found and consider if to include also nonlinear terms in the variables. 
In particular we note that the the weigth should be proportional to the volume of the body and that the volume likely depends on cubic terms in the variables length and cranial circumferece. Indeed if we think about the body as a cylinder or parallelepipedon we could write its volume in terms of $length\times(cranial\_circumference)^2$ or $length^2\times cranial\_circumference$. 

Additionally the scatterplots suggest a possible nonlinear dependence of the weight on the gestational age. As a first step we study the correlation of some nonlinear terms between each other and with the variable weight. 

```{r}
dataset$Gestazione2 <- (dataset$Gestazione)^2
dataset$logGestazione <- log(dataset$Gestazione)

pairs(dataset[, c("Peso", "Gestazione", "Gestazione2",
                "logGestazione")], lower.panel=panel.cor, upper.panel=panel.smooth)

dataset$Lunghezza2 <- (dataset$Lunghezza)^2
dataset$Lunghezza2xCranio <- (dataset$Lunghezza)^2*(dataset$Cranio)
dataset$Cranio2 <- (dataset$Cranio)^2
dataset$LunghezzaxCranio2 <- dataset$Lunghezza*(dataset$Cranio)^2

pairs(dataset[, c("Peso","Lunghezza", "Lunghezza2", "Lunghezza2xCranio", 
                "Cranio", "Cranio2", "LunghezzaxCranio2")], lower.panel=panel.cor, upper.panel=panel.smooth)
```

Due to the realtively small range intervals of the variables we note that the correlation between all the variables Gestation time, Length and Cranial circumference with their quadratic version, as also between gestation time and its logarithm, is 1 up to a second error approximation. So it doesn't make sense to include these terms in the model since we can not expect them to perform significantly better than the linear terms.

The situation is slightly different for the cubic terms mixed in length and cranial circumference. Indeed, even if they have high correlations with the variables length and cranial circumference they also have higher correlations with the variable weight with respect to the linear terms. So we try to include these terms in the model.


```{r}
tot_non_lin_mod <- lm(Peso ~ N.gravidanze_CL + Gestazione + Lunghezza + Cranio + Lunghezza2xCranio + LunghezzaxCranio2 + Sesso, data=dataset)

nonlin_mod = MASS::stepAIC(tot_non_lin_mod, direction="both", k=log(N_observations))

summary(nonlin_mod)

kable(BIC(mod3, nonlin_mod), caption="Bayesian Information Criterion (BIC) of the linear model vs the nonlinear one") %>%
  kable_styling()

```
The model selected by the stepwise method by taking into account for all the possible nonlinear terms that we have previously considered as a Bayesian Information Criterion (BIC) value significantly smaller than the best model that is linear in all the variables. 

Next we compute the variance inflation factors of the selected model. 

```{r}

kable(car::vif(nonlin_mod), caption="Variance Inflation factors of the nonlinear model") %>%
  kable_styling()

```
As expected the model has extremely high variance inflaction factor due to the high correlation coefficients between the variables. Since the high variance inflation factors make the coefficient estimates unstable and unreliable, we prefer to discard the last model.

However we have learnt that the inclusion of nonlinear terms can improve the model. So, next, we try to update the last model by removing variables that are highly correlated to see if we can get a model that is more accurate than the linear model mod3 without having high variance inflation factors. In particular we highlight that a model expressing the weight linearly with respect to a volume would be physically much more meaningfull than one espressing the weight linearly with respect to lengths. For these reasons we opt for removing from the model the two linear variables "length" and "cranial_circumference"

```{r}
nonlin_mod2 <- update(nonlin_mod, . ~ . - Lunghezza - Cranio)

summary(nonlin_mod2)

kable(car::vif(nonlin_mod2), caption="Variance Inflation factors of the nonlinear model") %>%
  kable_styling()
```

After removing the variables length and cranial circumference the variance inflation factor is still high for the two cubic variables. For this reason we remove from the model also the variable $Lunghezza\times Cranio^2$ which results the less influential. 

```{r}
nonlin_mod3 <- update(nonlin_mod2, . ~ . - LunghezzaxCranio2)

summary(nonlin_mod3)

kable(car::vif(nonlin_mod3), caption="Variance Inflation factors of the nonlinear model") %>%
  kable_styling()

kable(BIC(mod2, nonlin_mod, nonlin_mod2, nonlin_mod3), caption="Bayesian information factor of the linear mode vs nonlinear ones") %>%
  kable_styling()


```

The final model predicts the weight of the new born based on gestation time, squared length times cranium circumference, sex and number of deliveries. The model has a BIC value larger than the best model which included all the possible nonlinear terms but smaller than the best model that included only linear terms in the variables. Moreover, differently from the best model that included all the possible nonlinear terms, the model "nonlin_mod3" has all the variance inflation factors smaller than 5.

We point out that the model "nonlin_mod3", including the mixed term $length^2\times cranial_circumferece$ but not the linear terms in the variables $length$ ad $cranial_circumferece$, may break the principle of marginality. However we highlight that the model is also physically more meaningfull that one including the linear terms in the variables $length$ and $cranial_circumference$.
For these reasons we accept this model and analyze it.
We start analyzing its coefficients 

```{r}
kable(coef(nonlin_mod3), caption="Coefficients of final nonlinear model") %>%
  kable_styling()
```
The coefficients say that for each unit increment in the gestation time (1 week) we observe an increment of approximately 40gr in the weigth of the newborn. Similarly for each 1mm^3 increment in the volume term ($Length^2 \times Cranial_circfumference$) we register an increment of $3.5*10^-5$ (gr) in the weight of the newborn. Finally we observe that the males will register approximately 7gr more than females and child after the first one weight approximately 4gr more than a first child.

Next we analyze the residuals, outliers and leverages of the model.


```{r}
par(mfrow=c(2,2))
plot(nonlin_mod3)

lmtest::bptest(nonlin_mod3)
lmtest::dwtest(nonlin_mod3)
shapiro.test(nonlin_mod3$residuals)
```

The analysis shows that there is no autocorrelation of the residuals (p-value of the Durbin-Watson test > 0.05). However we observe that the residuals are not normally distributed (p-value of the Shapiro-Wilk test <0.05) and that there is no heteroscedasticity (p-value of the Breusch-Pagan < 0.05). Additionally we observe that there are some highly influential outlier datapoints having Cook distance close to 0.5.

So we investigate deeper the leverages and outliers. We consider leverages the points having leverage value larger than the threshold. This is set equal to 3 times the leverage mean, i.e. the number of parameters of the model over the number of observations in the dataset. Moreover the outliers are identified by a Bonferroni test on the studentized residuals of the model.

```{r}

#leverage
lev<-hatvalues(nonlin_mod3)
plot(lev)
p<-sum(lev)
n<-length(lev)
soglia=3*p/n
abline(h=soglia,col=2)
title(main = "Leverage values of model mod1")
leverages=which(lev>soglia)
cat("There are", length(leverages), "leverages\n")

#outliers
plot(rstudent(nonlin_mod3))
abline(h=c(-2,2))
title(main = "Studentized Residuals of model mod1")


cat("The outliers are:\n")
car::outlierTest(nonlin_mod3)


#distanza di cook
cook<-cooks.distance(nonlin_mod3)
plot(cook,ylim = c(0,1)) 
title(main = "Cook distance of residuals of model mod1")
```

We note that the most influential outlier, the datapoint 1549, with a Cook distance around 0.2, could have a relevant effect on the estimate of the coefficients of the model but not drastical. By investigating this datapoint we find that this is the same datapoint that we have already met before which has a length much lower than expected, probably for a typo in its transposition.


```{r}
kable(t(dataset[1549,]), caption="Outlier datapoint") %>%
  kable_styling()

```


Next we evaluate the quality of the model in terms of the $R^2$ and RMSE metrics.
```{r}
# R^2 :
summary_mod <- summary(nonlin_mod3)
R2 <- summary_mod$r.squared
cat("The R^2 value of the model is equal to", R2, ", so the model is explaining about 73% of the variance of the variable Weight. \n")

# RMSE:
predictions <- predict(nonlin_mod3, newdata = dataset)
residuals <- dataset$Peso - predictions
RMSE <- sqrt(mean(residuals^2))
perRMSE=RMSE/mean(dataset$Peso)
cat("The RMSE value of the model is equal to", RMSE, ".\n") 
cat("This means that the model has a mean error of prediction of" , paste0(round(perRMSE,2),"%"),"with respect to the mean of the variable weight.\n")
cat("Moreover this means that the variance of the prediction errors of the model is about", round(RMSE/sd(dataset$Peso),2), "of the variance of the variable weight")
```
## Using the model to make predictions
Assume to have a female baby that will be delivered at the 39th week and whose mother has already delivered two children. Then we can predict her weight. 
For the new baby we assume a length and cranial circumference equal to the mean weight and cranial circumferene of female newborns at the 39th gestation week.


```{r}
length= mean(subset(dataset, Sesso == "F" & Gestazione==39)$Lunghezza)
cranium= mean(subset(dataset, Sesso == "F" & Gestazione==39)$Cranio)

new_data <- data.frame("N.gravidanze" = c(2),
                       "Gestazione"=c(39),
                       "Lunghezza"=c(length),
                       "Cranio"=c(cranium),
                       "Sesso"=c("F")
                       )

new_data$Lunghezza2xCranio = (new_data$Lunghezza)^2*(new_data$Cranio)
new_data$Cranio2 <- (new_data$Cranio)^2
new_data$N.gravidanze_CL <- cut(new_data$N.gravidanze, 
                     breaks=c(0,1,100), 
                     right=FALSE, 
                     labels=c("0","1+"))


weight_prediction <- round(predict(nonlin_mod3, newdata = new_data))


cat("The regression model predicts a weight equal to", paste0(weight_prediction,"gr"), ".\n") 
```


## Finally we display the relations captured in the model.

<!-- ```{r} -->

<!-- ggplot(data=dataset_clean, -->
<!--        aes(x = Lunghezza,  -->
<!--            y = Cranio,  -->
<!--            color = Peso)) + -->
<!--        geom_point(size = 3.5) + -->
<!--        labs(title = "Weight as a function of Length and Cranial circumference", -->
<!--             x = "Length", -->
<!--             y = "Cranial circumference",  -->
<!--             color = "Weight")+ -->
<!--       theme_classic()+ -->
<!--       theme(axis.title.x = element_text(size = 14, margin = margin(t = 10)), -->
<!--             axis.title.y = element_text(size = 14, margin = margin(r = 10))) -->

<!-- ``` -->
 

First we display how the weight is distributed according to the sex and number of deliveries with respect to the quantitative variables in the model, i.e. $gestation\_time$ and $(length)^2\times cranial\_circumference$.


```{r, fig.width=11.5, fig.height=5}

plot_fin_1 = ggplot(data=dataset)+
                    geom_point(aes(x=Gestazione,
                                 y=Peso,
                                 col=Sesso),position = "jitter")+
                    geom_smooth(aes(x=Gestazione,
                                    y=Peso,
                                    col=Sesso), se=F, method = "lm")+
                   scale_color_manual(values= sex_colors, 
                                   labels = c("M"= "Male", "F"="Female"))+
                   labs(title = "Weight as a function of Gestation time",
                        x = "gestation time",
                        y = "weight (gr)")+
                    theme_classic()+
                    theme(axis.title.x = element_text(size = 14, margin = margin(t = 10)),
                          axis.title.y = element_text(size = 14, margin = margin(r = 10)),
                          legend.title = element_text(size = 14),
                          legend.text = element_text(size = 14))


legend <- as_ggplot(get_legend(plot_fin_1))

plot_fin_1 = ggplot(data=dataset)+
                    geom_point(aes(x=Gestazione,
                                 y=Peso,
                                 col=Sesso),position = "jitter")+
                    geom_smooth(aes(x=Gestazione,
                                    y=Peso,
                                    col=Sesso), se=F, method = "lm")+
                   scale_color_manual(values= sex_colors, 
                                   labels = c("M"= "Male", "F"="Female"))+
                   labs(title = "Weight as a function of Gestation time",
                        x = "gestation time",
                        y = "weight (gr)")+
                    theme_classic()+
                    theme(axis.title.x = element_text(size = 14, margin = margin(t = 10)),
                          axis.title.y = element_text(size = 14, margin = margin(r = 10)),
                          legend.position = "none")


 plot_fin_3 = ggplot(data=dataset)+
                    geom_point(aes(x=Lunghezza2xCranio,
                                 y=Peso,
                                 col=Sesso))+
                    geom_smooth(aes(x=Lunghezza2xCranio,
                                    y=Peso,
                                    col=Sesso), se=F, method = "lm")+
                   scale_color_manual(values= sex_colors, 
                                   labels = c("M"= "Male", "F"="Female"))+
                    labs(title = "Weight as a function of \n Lengt^2 x cranial circumference",
                        x = "Lengt^2 x cranial circumference (mm^3)",
                        y = "weight (gr)")+
                    theme_classic()+
                    theme(axis.title.x = element_text(size = 14, margin = margin(t = 10)),
                          axis.title.y = element_text(size = 14, margin = margin(r = 10)),
                          legend.position = "none")

  
  
comb_fin <- plot_fin_1 + plot_fin_3 + legend + plot_layout(ncol=3, heights = c(5), widths= c(5,5,1.5))

print(comb_fin)
```



```{r, fig.width=11.5, fig.height=5}

plot_fin_1 = ggplot(data=dataset)+
                    geom_point(aes(x=Gestazione,
                                 y=Peso,
                                 col=N.gravidanze_CL),position = "jitter")+
                    geom_smooth(aes(x=Gestazione,
                                    y=Peso,
                                    col=N.gravidanze_CL), se=F, method = "lm")+
                   labs(title = "Weight as a function of Gestation time",
                        x = "gestation time",
                        y = "weight (gr)",
                        color = "N° previous deliveries")+
                    theme_classic()+
                    theme(axis.title.x = element_text(size = 14, margin = margin(t = 10)),
                          axis.title.y = element_text(size = 14, margin = margin(r = 10)),
                          legend.title = element_text(size = 14),
                          legend.text = element_text(size = 14))


legend <- as_ggplot(get_legend(plot_fin_1))

plot_fin_1 = ggplot(data=dataset)+
                    geom_point(aes(x=Gestazione,
                                 y=Peso,
                                 col=N.gravidanze_CL),position = "jitter")+
                    geom_smooth(aes(x=Gestazione,
                                    y=Peso,
                                    col=N.gravidanze_CL), se=F, method = "lm")+
                   labs(title = "Weight as a function of Gestation time",
                        x = "gestation time",
                        y = "weight (gr)")+
                    theme_classic()+
                    theme(axis.title.x = element_text(size = 14, margin = margin(t = 10)),
                          axis.title.y = element_text(size = 14, margin = margin(r = 10)),
                          legend.position = "none")

  
 plot_fin_3 = ggplot(data=dataset)+
                    geom_point(aes(x=Lunghezza2xCranio,
                                 y=Peso,
                                 col=N.gravidanze_CL))+
                    geom_smooth(aes(x=Lunghezza2xCranio,
                                    y=Peso,
                                    col=N.gravidanze_CL), se=F, method = "lm")+
                    labs(title = "Weight as a function of \n Lengt^2 x cranial circumference",
                        x = "Lengt^2 x cranial circumference (mm^3)",
                        y = "weight (gr)")+
                    theme_classic()+
                    theme(axis.title.x = element_text(size = 14, margin = margin(t = 10)),
                          axis.title.y = element_text(size = 14, margin = margin(r = 10)),
                          legend.position = "none")

  
  
comb_fin <- plot_fin_1 + plot_fin_3 + legend + plot_layout(ncol=3, heights = c(5), widths= c(5,5,1.5))

print(comb_fin)
```

The plots confirm that the weight increases while increasing all the other variables in the model. Moreover we observe that the weight observed in males is always higher than the weight observed in females. Differently the difference in weight between first children and not first children is only observable when varying the variable gestation time. In this case we observe that the weight of first children is slightly smaller than the weight of not first children. The observations are in agreements with the discussion about the $\beta$ coefficients of the model. 

