library(modelsummary)
library(patchwork)
library(dplyr) 
library(ggplot2)
library(moments)
library(lmtest)
library(lubridate)
library(knitr)
library(gghalves)
library(RColorBrewer)
library(MASS)
library(car)
library(synthpop)




df = read.csv("neonati.csv", stringsAsFactors = TRUE)

df$Fumatrici <- factor(df$Fumatrici, 
                        levels = c(0, 1),
                        labels = c("Not a Smoker", "Smoker"))

First Visualization

kable(summary(df), caption = "Summary of the dataset")

Summary of the dataset
Anni.madre	N.gravidanze	Fumatrici	Gestazione	Peso	Lunghezza	Cranio	Tipo.parto	Ospedale	Sesso
Min. : 0.00	Min. : 0.0000	Not a Smoker:2396	Min. :25.00	Min. : 830	Min. :310.0	Min. :235	Ces: 728	osp1:816	F:1256
1st Qu.:25.00	1st Qu.: 0.0000	Smoker : 104	1st Qu.:38.00	1st Qu.:2990	1st Qu.:480.0	1st Qu.:330	Nat:1772	osp2:849	M:1244
Median :28.00	Median : 1.0000	NA	Median :39.00	Median :3300	Median :500.0	Median :340	NA	osp3:835	NA
Mean :28.16	Mean : 0.9812	NA	Mean :38.98	Mean :3284	Mean :494.7	Mean :340	NA	NA	NA
3rd Qu.:32.00	3rd Qu.: 1.0000	NA	3rd Qu.:40.00	3rd Qu.:3620	3rd Qu.:510.0	3rd Qu.:350	NA	NA	NA
Max. :46.00	Max. :12.0000	NA	Max. :43.00	Max. :4930	Max. :565.0	Max. :390	NA	NA	NA

Age of the mother smaller than 12-14 is probably errors in the dataset, so here the cleaned dataset. Even if 12-14 is still a very young age it happens, so it’s going to be keep in the dataset.

df <- df[df$Anni.madre > 12, ]
attach(df)

kable(summary(df), caption = "Summary of the dataset")

Summary of the dataset
Anni.madre	N.gravidanze	Fumatrici	Gestazione	Peso	Lunghezza	Cranio	Tipo.parto	Ospedale	Sesso
Min. :13.00	Min. : 0.0000	Not a Smoker:2394	Min. :25.00	Min. : 830	Min. :310.0	Min. :235	Ces: 728	osp1:816	F:1255
1st Qu.:25.00	1st Qu.: 0.0000	Smoker : 104	1st Qu.:38.00	1st Qu.:2990	1st Qu.:480.0	1st Qu.:330	Nat:1770	osp2:848	M:1243
Median :28.00	Median : 1.0000	NA	Median :39.00	Median :3300	Median :500.0	Median :340	NA	osp3:834	NA
Mean :28.19	Mean : 0.9816	NA	Mean :38.98	Mean :3284	Mean :494.7	Mean :340	NA	NA	NA
3rd Qu.:32.00	3rd Qu.: 1.0000	NA	3rd Qu.:40.00	3rd Qu.:3620	3rd Qu.:510.0	3rd Qu.:350	NA	NA	NA
Max. :46.00	Max. :12.0000	NA	Max. :43.00	Max. :4930	Max. :565.0	Max. :390	NA	NA	NA

1. Preliminary Analysis

A. Variable:

Explanatory variable:

“Anni.Madre”: age of the mother; it’s a discrete quantitative variable.
“N.gravidanze”: number of pregnancy of the mother; it’s a discrete quantitative variable.
“Fumatrici”: binary variable (dummy variable); 1 = mother is a smoker - 0 = mother is not a smoker.
“Gestazione”: number of week of the pregnancy; it’s a discrete quantitative variable.
“Lunghezza”: newborn’s length in mm; it’s a continous quantitative variable.
“Cranio”: newborn’s head diameter in mm; it’s a continous quantitative variable.
“Tipo.parto”: type of delivery, cesarean or natural; it’s a qualitative variable, with two class.
“Ospedale”: Identification of the hospital; it’s a qualitative variable with three class.
“Sesso”: newborn’s gender; it’s a qualitative variable with two class.

The qualitative variable are converted to factor, so that can be used as class during the analysis.

Target Variable:

“Peso”: newborn’s weight in grams;it’s a continous quantitative variable.

B. Distribution and Box Plot:

Quantitative Variable

Continous

gini_index_func = function(col){
  n_i = table(col)
  f_i_2 = (table(col)/length(col)) ^ 2
  j = length(table(col))
  gini = 1 - sum(f_i_2)
  gini_norm = gini/((j - 1)/j)
  return(gini_norm)}


density_form_con = function(column, label_name, y_cord, y_cord_2){
  kurt = round(kurtosis({{column}}) - 3, 1)
  fis = round(skewness({{column}}), 1)
  
  string_1 = paste("Kurtosis Index = ", kurt)
  
  string_2 = paste("Fischer Index = ", fis)
  
  plot_title = paste0("Newborn's ", label_name, " Distribution" )
  
  
  density_value = density({{column}})
  
  
  g1 = ggplot(data = df)+
    geom_half_violin(aes(y = {{column}}),side = "left", col = "#5A7FA8", linewidth =
                       0.7)+
    geom_half_boxplot(aes(y = {{column}}), fill = "#5A7FA8")+
    labs(title = plot_title, x = label_name, y = "")+
    theme_classic()+
    theme(plot.title = element_text(hjust = 0.5))+
    annotate("label", label = string_1, x = 0.7, y = y_cord, fill = "white", vjust = 1,
             hjust = 1)+
    annotate("label", label = string_2, x = 0.7, y = y_cord_2, 
             fill = "white",
             vjust = 1, hjust = 1)
  
  return(g1)
}

plt_cra = density_form_con(Cranio, "Cranium diameter", 400, 388)

plt_cra

The newborn’s Cranium diameter has an high peak; the shape is capture by the Kurtosis index that is greater than 0. The Fischer index suggest correctly the asymmetry of the newborn’s Cranium diameter; with a negative asymmetry. From the boxplot is also possible to see various outliers.

plt_len = density_form_con(Lunghezza, "Length (mm)", 560, 542)

plt_len

The newborn’s length has a more elongated shape, with an higher peak. This shape it’s also capture by the Kurtosis index that is greater than 0. The same logic of before can be applied for the Fischer index that capture the negative asymmetry of the distribution. Similar to the Cranium size the boxplot has various outlier.

plt_wei = density_form_con(Peso, "weight", 5000, 4720)

plt_wei

The target variable has a kurtosis index equal to 2, that means a more elongated shape and an higher peak in comparison with the normal distribution. Fischer index is slightly negative, so the shape is similar to a normal distribution with a small negative asymmetry.

shap = shapiro.test(Peso)
kable(paste("The target variable does not have a normal distribution, this can be seen from the p-value of the shapiro test, because is smaller than 0.05 and so the null hypothesis is rejected: p-value = ", round(shap$p.value,3)), col.names = "")

The target variable does not have a normal distribution, this can be seen from the p-value of the shapiro test, because is smaller than 0.05 and so the null hypothesis is rejected: p-value = 0

Discrete:

density_form_dis = function(column, label_name, y_cord){
  kurt = round(kurtosis({{column}}) - 3, 1)
  fis = round(skewness({{column}}), 1)
  

  string_1 = paste("Kurtosis Index = ", kurt)

  string_2 = paste("Fischer Index = ", fis)

  plot_title = paste0(label_name, " Distribution" )


  g1 = ggplot(data = df)+
    geom_boxplot(aes(y = {{column}}), fill = "#5A7FA8")+
    labs(title = plot_title, y = label_name, x = "Count")+
    theme_classic()+
    theme(plot.title = element_text(hjust = 0.5))+
    annotate("label", label = string_1, x = 0.7, y = 45, fill = "white", vjust = 1,
             hjust = 1)+
   annotate("label", label = string_2, x = 0.7, y = y_cord, fill = "white", vjust = 1,
           hjust = 1)

  g2 = ggplot(data = df)+
    geom_col(aes(x = {{column}}),stat = "count",col = "black" , fill = "#5A7FA8",
             linewidth = 0.7)+
    labs(title = plot_title, x = label_name, y = "")+
    theme_classic()+
    theme(plot.title = element_text(hjust = 0.5))
  return(g1 | g2)
}

plt_ges = density_form_dis(Gestazione, "Month of Pregnancy", 43.5)


plt_ges

The month of pregnancy has the same asymmetry and shape as “Lughezza” and “Cranio”, this behavior is very clear from the two index evaluate. From the boxplot is also possible to see some outliers especially in the lower range of the variable.

plt_age = density_form_dis(Anni.madre, "Mother Age", 41.5)


plt_age

Instead the age distribution has a shape equal to the normal distribution. The bell shape is reflected also in the index that are very close to 0.

ggplot()+
  geom_col(aes(x = N.gravidanze), stat = "count", col = "black", fill = "#5A7FA8")+
  labs(title = "# Pregnancy Distribution", x = "# Pregnancy", y = "Count", caption = "Caption: Visualization of the frequency distribution for the number of pregnancy")+
  theme_classic()+
  theme(plot.title = element_text(hjust = 0.5),
        plot.caption = element_text(hjust = 0.45,vjust = -1,face = "bold",size = 10, colour = "black" ))

Qualitative Variable

Below are plotted the frequency for all the qualitative variable. It’s also evaluated the Gini Index, where an Index near 1 means that the variable is evenly distributed across the classes; instead a Gini index equal to zero is related to a variable that has one dominant class.

viz_qual_func = function(x, col_label, y_cord,x_cord, h_cord, v_cord, size_val){
  
  
  gini_string = paste("Gini Index = ", round(gini_index_func({{x}}) , 1))
  title_string = paste("Frequency Distribution: ", col_label)
  
  g1 = ggplot()+
    geom_col(aes( x = {{x}}), stat = "count", fill = "#5A7FA8", col = "black")+
    labs( title = title_string, x = col_label, y = "Count")+
    theme_classic()+
    theme(plot.title = element_text(hjust = 0.5))+
     annotate("label", label = gini_string, 
              x = x_cord, y = y_cord,
              vjust = v_cord,
              hjust = h_cord, 
              alpha = 0.3, size = size_val)
  return(g1)
}


plot_fum = viz_qual_func(Fumatrici, "Smoker", Inf, Inf, 1.1, 1, 4)
plot_fum

plot_del = viz_qual_func(Tipo.parto, "Cesarean or Natural", Inf, -Inf, -0.1, 1, 4)
plot_del

plot_ho = viz_qual_func(Ospedale, "Hospital", Inf, Inf, 1, 1, 3.2)
plot_gen = viz_qual_func(Sesso, "Male or Female", Inf, Inf, 1.1, 1, 3.2)

plot_ho | plot_gen

The dataset is evenly distributed between the Hospitals and the newborn’s gender; also the mode of delivery is evenly distributed with a Gini index of 0.8 and a slight predominance of the Natural delivery. The First figure tells us that the mother are almost everytime non smoker in this dataset.

C. Hypothesis test:

H_{0}: In some hospitals, more cesarean sections are performed

Let’s first visualize the frequency table observed between the three hospitals.

ggplot()+
  geom_col(aes(x = Tipo.parto, fill = Ospedale), stat = "count", position = "dodge")+
  labs(title = "Mode of delivery by hospital", x = "Mode of delivery", y = "Count")+
  theme_classic()+theme(plot.title = element_text(hjust = 0.5))

To reject or not the null hypothesis a chi-squared test can be perform.

chi = chisq.test(table(Ospedale, Tipo.parto))

kable(paste("Chi-Squared test p-value = ", round(chi$p.value, 2) , ". So the null hypothesis is rejected and from the dataset the number of cesarean section is indipendent from the hospital"), col.names = "")

Chi-Squared test p-value = 0.58 . So the null hypothesis is rejected and from the dataset the number of cesarean section is indipendent from the hospital

The average weight and length of this dataset is statistical representative of the population

WEIGHT:

The known average weight of the newborn’s is 3300g.

test_data =t.test(Peso, mu = 3300, conf.level = 0.95)


kable(paste("Taken the population mean for the newborn's weight as 3300g a T-test is performed with a confidence interval of 95% considering both side for the t-test. The p-value is equal to ", round(test_data$p.value, 3), ". So the null hypothesis is rejected and the mean is not equal to the population mean."), col.names = "")

Taken the population mean for the newborn’s weight as 3300g a T-test is performed with a confidence interval of 95% considering both side for the t-test. The p-value is equal to 0.132 . So the null hypothesis is rejected and the mean is not equal to the population mean.

kable(paste("However the population mean is within the confidence level of [", round(test_data$conf.int[1],2),round(test_data$conf.int[2], 2), "]. So the difference can be attributed to the data selection."), col.names = "")

However the population mean is within the confidence level of [ 3263.58 3304.79 ]. So the difference can be attributed to the data selection.

kable("Considering the confindence level analysis the newborn's weight of the dataset is significantly equal to the population mean.", col.names = "")

Considering the confindence level analysis the newborn’s weight of the dataset is significantly equal to the population mean.

LENGHT:

The known average weight of the newborn’s is 500mm.

test_data =t.test(Lunghezza, mu = 500, conf.level = 0.95)


kable(paste("Taken the population mean for the newborn's length as 500mm a T-test is performed with a confidence interval of 95% considering both side for the t-test. The p-value is equal to ", round(test_data$p.value, 3), ". So the null hypothesis is rejected and the mean is not equal to the population mean."), col.names = "")

Taken the population mean for the newborn’s length as 500mm a T-test is performed with a confidence interval of 95% considering both side for the t-test. The p-value is equal to 0 . So the null hypothesis is rejected and the mean is not equal to the population mean.

kable(paste("The population mean is also outside the confidence interval of [", round(test_data$conf.int[1],2),round(test_data$conf.int[2], 2), "]. So the difference can not be attributed to the data selection."), col.names = "")

The population mean is also outside the confidence interval of [ 493.66 495.73 ]. So the difference can not be attributed to the data selection.

kable("The newborn's length of the dataset is significantly different to the population mean.", col.names = "")

The newborn’s length of the dataset is significantly different to the population mean.

The average of weight,length and cranium diameter statistically different between the gender

To answer a t-test is performed, with a confidence level of 95% and considering both side of the distribution. The t-test is performed considering the various measurements in relation with the gender.

test_data = t.test(data = df, Peso ~ Sesso)

ggplot()+
  geom_boxplot(aes(y = Peso, x = Sesso, fill = Sesso))+
  labs(title = "Weight Distribution by gender", x = "Gender", y = "Weight")+
  theme_classic()+theme(plot.title = element_text(hjust = 0.5))+
  annotate("label", label = paste("Mean = ", round(test_data$estimate[2], 2)), 
              x = 2.3, y = 1000,
              alpha = 0.3)+
  annotate("label", label = paste("Mean = ", round(test_data$estimate[1], 2)), 
              x = 0.7, y = 1000,
              alpha = 0.3)

kable(paste("p-value = ", round(test_data$p.value,3), ". So the null hypotesis is rejected and the weight mean between the two gender is statistically different."), col.names = "")

p-value = 0 . So the null hypotesis is rejected and the weight mean between the two gender is statistically different.

test_data = t.test(data = df, Lunghezza ~ Sesso)

ggplot()+
  geom_boxplot(aes(y = Lunghezza, x = Sesso, fill = Sesso))+
  labs(title = "Length Distribution by gender", x = "Gender", y = "Weight")+
  theme_classic()+theme(plot.title = element_text(hjust = 0.5))+
  annotate("label", label = paste("Mean = ", round(test_data$estimate[2], 2)), 
              x = 2.3, y = 300,
              alpha = 0.3)+
  annotate("label", label = paste("Mean = ", round(test_data$estimate[1], 2)), 
              x = 0.7, y = 300,
              alpha = 0.3)

kable(paste("p-value = ", test_data$p.value, ". So the null hypotesis is rejected and the length mean between the two gender is statistically different."), col.names = "")

p-value = 2.23232824860459e-21 . So the null hypotesis is rejected and the length mean between the two gender is statistically different.

test_data = t.test(data = df, Cranio ~ Sesso)

ggplot()+
  geom_boxplot(aes(y = Cranio, x = Sesso, fill = Sesso))+
  labs(title = "Cranium size Distribution by gender", x = "Gender", y = "Weight")+
  theme_classic()+theme(plot.title = element_text(hjust = 0.5))+
  annotate("label", label = paste("Mean = ", round(test_data$estimate[2], 2)), 
              x = 2.3, y = 240,
              alpha = 0.3)+
  annotate("label", label = paste("Mean = ", round(test_data$estimate[1], 2)), 
              x = 0.7, y = 240,
              alpha = 0.3)

kable(paste("p-value = ", test_data$p.value, ". So the null hypotesis is rejected and the Cranium diameter mean between the two gender is statistically different."), col.names = "")

p-value = 1.41401856590502e-13 . So the null hypotesis is rejected and the Cranium diameter mean between the two gender is statistically different.

2. Creation of the Regression Model

model_1 = lm(Peso ~., data = df)
summary(model_1)

## 
## Call:
## lm(formula = Peso ~ ., data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1123.26  -181.53   -14.45   161.05  2611.89 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -6735.7960   141.4790 -47.610  < 2e-16 ***
## Anni.madre          0.8018     1.1467   0.699   0.4845    
## N.gravidanze       11.3812     4.6686   2.438   0.0148 *  
## FumatriciSmoker   -30.2741    27.5492  -1.099   0.2719    
## Gestazione         32.5773     3.8208   8.526  < 2e-16 ***
## Lunghezza          10.2922     0.3009  34.207  < 2e-16 ***
## Cranio             10.4722     0.4263  24.567  < 2e-16 ***
## Tipo.partoNat      29.6335    12.0905   2.451   0.0143 *  
## Ospedaleosp2      -11.0912    13.4471  -0.825   0.4096    
## Ospedaleosp3       28.2495    13.5054   2.092   0.0366 *  
## SessoM             77.5723    11.1865   6.934 5.18e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 274 on 2487 degrees of freedom
## Multiple R-squared:  0.7289, Adjusted R-squared:  0.7278 
## F-statistic: 668.7 on 10 and 2487 DF,  p-value: < 2.2e-16

Non-Categorical Variable:

Anni.madre: p-value is above 0.05, so this variable does not have a significant influence on the newborn’s weight. Also the related beta coefficient is small considering that weight is in the thousand scale and a change in the mother’s age influence the weight of only 0.89. Everything confirm that the null hypothesis cannot be rejected and the coefficient is null;
N.gravidanze: p-value is below the threshold with a value of 0.02 and the coefficient is 11.3, so this means that the number of deliver is statistically important in the newborn’s weight;
Gestazione: as expected the p-value is almost null and beta is equal to 32.6. This confirm that a longer pregnancy leads to a “heavier” baby;
Lunghezza - Cranio: this two variable have a statistically significance in the newborn’s weight. This is very simple to imagine like for adults: is very likely that one person 2.00m tall is going to be heavier of one 1.50m tall. Null hypothesis rejected for both. Also is possible to see that both have a very similar coefficient meaning that they have a similar influence to the weight. Further in this paper is going to be investigated if this two variable influence each other and if the model will need just one of the two or both.

Categorical Variable:

For categorical variable the summary needs to be read slightly different. The coefficient means that taken as constant the other variable a change of class define a change of the output variable equal to the coefficient.

Fumatrici (Non-Smoker -> Smoker): smoking is known in general as bad for the newborns and the pregnancy itself. This can be seen in the coefficient equal to -30.2, that means in general a non-smoker will have a 30g heavier newborn. Very similar to the impact of the duration of the pregnancy;
Tipo.parto (cesarean section -> Natural): changing from a cesarean section and a natural delivered leads to a weight’s increase. This can be understood by thinking that the cesarean section is always an emergency situation due to needs or something wrong with the pregnancy and is always done weeks before the natural due date. So this can be seen as a consequence of emergency and duration of the pregnancy itself.
Ospedaleosp2 (Other osp -> osp2): between different hospital the only reason for different coeffiencient is that the second hospital is the one with the most cesarean section, in fact the coefficient is negative;
Ospedaleosp3 (Other osp -> osp3): instead the osp3 is the one with the most natual deliver and the coeffienct is positive;
Sesso (F -> M): as seen before the gender has a big influence on the weight and the beta value confirm that. So this variable is the one with the most impact on the newborn’s baby.

2 Searching the best Regression Model

The starting point of the saturated model (model_1) is:

AIC = 35171
BIC = 35241
R\(^{2}\) ~ 0.73

2.1 Stepwiese method

Correllation Matrix

panel.cor <- function(x, y, digits = 2, prefix = "", cex.cor, ...)
{
    par(usr = c(0, 1, 0, 1))
    r <- abs(cor(x, y))
    txt <- format(c(r, 0.123456789), digits = digits)[1]
    txt <- paste0(prefix, txt)
    if(missing(cex.cor)) cex.cor <- 0.8/strwidth(txt)
    text(0.5, 0.5, txt, cex = cex.cor * r)
}

pairs(df, upper.panel = panel.smooth, lower.panel = panel.cor)

Above there is a plot of the correlation Matrix where is possible to see the correlation between the quantitative variable.

As expected the three physical measurement are highly correlated, with length and cranium size that are related with a coefficient of 0.60 and are respectively correlated with the weight with a coefficient of 0.80 and 0.70. The correlation matrix confirm the relation between “Gestazione” and the physical measurements, with coefficient 0.59 with weight, 0.62 with length, 0.46 with Cranium size; this aligns with what it’s known about pregnancy. From the chart it’s also possible to see that the relation between the months of the pregnancy and the physical measurements starts with a linear behavior and has a sort of plateau reached a certain body size.

The high correlation between cranium size and length suggest that it can be interaction within the structure of the model, as well as length and “Gestazione”.

Searching the right model

modelcomparison = function(model_A, model_B){
  dataanova = anova(model_A, model_B)
  pvalue = dataanova$`Pr(>F)`[2]
  
  aic_a = AIC(model_A)
  bic_a = BIC(model_A)

  aic_b = AIC(model_B)
  bic_b = BIC(model_B)
  
  summ_a = summary(model_A)
  summ_b = summary(model_B)
  r_a = round(summ_a$r.squared, 3)
  r_b = round(summ_b$r.squared, 3)
  
  mse_model_a = mean(model_A$residuals^2)
  rmse_model_a = sqrt(mse_model_a)
  
  mse_model_b = mean(model_B$residuals^2)
  rmse_model_b = sqrt(mse_model_b)
  
  model <- c("Starting Model", "New Model")
  AIC <- c(aic_a, aic_b)
  BIC <- c(bic_a, bic_b)
  R <- c(r_a, r_b) 
  RMSE_value <- c(rmse_model_a, rmse_model_b)
  anovatest <- c("", round(pvalue, 3))
  
  tablecomparison <- data.frame(Models = model, AIC_value = AIC, BIC_value = BIC, R_squared = R, RMSE =  RMSE_value, ANOVA_pvalue = anovatest)
  
  return(tablecomparison)
}

I. There should be no difference between different hospital.

model_2 = update(model_1, ~. - Ospedale)
table12 = modelcomparison(model_1, model_2)
kable(table12)

Models	AIC_value	BIC_value	R_squared	RMSE	ANOVA_pvalue
Starting Model	35145.57	35215.45	0.729	273.4174
New Model	35150.75	35208.98	0.728	273.9203	0.01

The anova test suggest that without the hospital variable the model are statistically different to the first model, with a p-value around 0.01 that is lower than 0.05. From the table above the BIC coefficient for model one (saturated model) is slightly lower to the new model’s coefficient, on the other side the AIC coefficient tells the opposite story. This suggests, with the R\(^{2}\) value that is the same between the two model, that the two model explain the same amount of information about the newborn’s weight, but the AIC parameter prefer over-parametrize model and BIC prefer simpler model. In this case simpler model is better because can lead to identify better what really influence the weight considering that there should be no difference between hospitals.

1. Years of the mother have a low impact on the weight:

model_3 = update(model_2, ~. - Anni.madre)
table23 = modelcomparison(model_2, model_3)
kable(table23)

Models	AIC_value	BIC_value	R_squared	RMSE	ANOVA_pvalue
Starting Model	35150.75	35208.98	0.728	273.9203
New Model	35149.33	35201.73	0.728	273.9518	0.45

The test confirm the behavior suggested by the summary of the saturated model. - pvalue of the anova test above the threshold of 0.05, so the two model are statistically equal. - BIC and AIC coefficient are very close, so the information explained by the two model remain equal, with a slight improvement in the coefficients of the new model especially in the BIC coefficient.

1. Studying the interaction between cranium size and length:

model_4 = update(model_3, ~. - Cranio)
table34 = modelcomparison(model_3, model_4)
kable(table34)

Models	AIC_value	BIC_value	R_squared	RMSE	ANOVA_pvalue
Starting Model	35149.33	35201.73	0.728	273.9518
New Model	35692.26	35738.84	0.661	305.5233	0

Without the Cranium size variable R\(^{2}\) goes from 0.73 to 0.66, so the percentage of information explained by the new model decrease significantly; this is confirmed by the pvalue of the anova test that is below the threshold. BIC and AIC coeffienct confirm that the starting model is better.

model_5 = update(model_3, ~. -Cranio - Lunghezza + Cranio * Lunghezza)
table35 = modelcomparison(model_3, model_5)
kable(table35)

Models	AIC_value	BIC_value	R_squared	RMSE	ANOVA_pvalue
Starting Model	35149.33	35201.73	0.728	273.9518
New Model	35128.71	35186.95	0.730	272.7146	0

Introducing an interaction factor between Cranium size and body length the anova test tells that the two model are statistically different and the BIC/AIC coefficient suggest that the model with the interaction factor explain better the newborn’s weight, with a lower BIC coefficient. In this case also the AIC coefficient is better for the new model. Meanwhile the percentage of the phenomenon explained by the model remain stable.

1. Non linearity:

Is very clear that the physical measurements is clearly one of the best way to predict the newborn’s weight. So it can be studied if the non linearity coefficient of the length and Cranio can make the model improve,

final_model = update(model_5, ~.  + I(Lunghezza**2) + I(Cranio**2))
table5f = modelcomparison(model_5, final_model)
kable(table5f)

Models	AIC_value	BIC_value	R_squared	RMSE	ANOVA_pvalue
Starting Model	35128.71	35186.95	0.73	272.7146
New Model	35042.83	35112.71	0.74	267.8520	0

The non linear coefficient of the newborn’s length improve the model with a general imporvement of R\(^{2}\), AIC, BIC coefficients and RMSE.

3 Model Quality Analysis

WINNING MODEL

kable(final_model$coefficients, col.names = "Coefficient values")

	Coefficient values
(Intercept)	-770.1966995
N.gravidanze	14.5143328
FumatriciSmoker	-23.5146476
Gestazione	41.1071786
Tipo.partoNat	28.4210553
SessoM	71.8929637
Cranio	4.2662118
Lunghezza	-11.7036094
I(Lunghezza^2)	0.0438045
I(Cranio^2)	0.0524431
Cranio:Lunghezza	-0.0599864

R\(^{2}\) = 0.74
RMSE = 268.85
AIC = 35085
BIC = 35143

Multicollinearity

vif_table = vif(final_model, type = "predictor")

kable(vif_table, caption = "Multicollinearity coefficients")

Multicollinearity coefficients
	GVIF	Df	GVIF^(1/(2*Df))	Interacts With	Other Predictors
N.gravidanze	1.028617	1	1.014208	–	Fumatrici, Gestazione, Tipo.parto, Sesso, Cranio, Lunghezza
Fumatrici	1.007964	1	1.003974	–	N.gravidanze, Gestazione, Tipo.parto, Sesso, Cranio, Lunghezza
Gestazione	1.853892	1	1.361577	–	N.gravidanze, Fumatrici, Tipo.parto, Sesso, Cranio, Lunghezza
Tipo.parto	1.004512	1	1.002253	–	N.gravidanze, Fumatrici, Gestazione, Sesso, Cranio, Lunghezza
Sesso	1.048700	1	1.024061	–	N.gravidanze, Fumatrici, Gestazione, Tipo.parto, Cranio, Lunghezza
Cranio	1.917242	5	1.067254	I(Cranio^2), Lunghezza	N.gravidanze, Fumatrici, Gestazione, Tipo.parto, Sesso
Lunghezza	1.917242	5	1.067254	I(Lunghezza^2), Cranio	N.gravidanze, Fumatrici, Gestazione, Tipo.parto, Sesso

The multicollinearity coefficients are under 5, so the model assumption on multicollinearity is respected.

General Overview

plot(final_model)

A. Residuals vs Fitted

The residuals plot is consistent with a random distribution around zero

B. Q-Q Plot

The points should be around the bisector, and the results have some problems at the extreme with the model that is not consistent at the extreme part, probably due to some outliers

C. Scale-Location

The points are distributied randomly around a constant between 1 and 0.5.

D. Residuals vs. Leverage

Here is possible to see that 3 values are outliers. Where 310 and 1429 are close to the 0.5 limit, so they shouldn’t be a problem for the model; but 1551 is outside of the threshold of 1, so can have an influence on the model.

Here the outliers:

kable(df[1549, ])

	Anni.madre	N.gravidanze	Fumatrici	Gestazione	Peso	Lunghezza	Cranio	Tipo.parto	Ospedale	Sesso
1551	35	1	Not a Smoker	38	4370	315	374	Nat	osp3	F

It seems a normal datapoint, where a non smoker mother was able to do a Natural delivery at the end of the pregnancy itself. The weight is above way bigger than the female average of 3161 and this can be the reason. Some pregnancy that lead to a newborn’s with an high weight exists and can also be due to different reason as well as diseases, so it’s correct to have it in the dataset even if can influence the model.

Residuals

ggplot()+
  geom_density(aes(residuals(final_model)))+
  labs(title = "Final Model: Density plot of the Residuals", y = "Density", x = "Residuals")+
  theme_minimal()+theme(plot.title = element_text(hjust = 0.5))

shap = shapiro.test(residuals(final_model))
bp = bptest(final_model)
dw = dwtest(final_model)

Tests <- c("Shapiro Test", "Breusch-Pagan Test", "Durbin-Watson Test")
Pvalue <- c(round(shap$p.value, 3), round(bp$p.value, 3), round(dw$p.value, 3))

recap_dataframe <- data.frame(Test = Tests, pvalue = Pvalue)
kable(recap_dataframe)

Test	pvalue
Shapiro Test	0.000
Breusch-Pagan Test	0.000
Durbin-Watson Test	0.085

From the result above it’s easy to understand that there are some problems with the model.

Shapiro Test: the residuals does not respect the normality
Breusch-Pagan Test: the omoschelasticity is not respected
Durbin-Watson Test: autocorrelation of the disturbances is consistent through the datapoints with a p-value above the threshold of 0.05

Outliers

A brief study of the outliers can explain what is happening.

lev = hatvalues(final_model)
p = sum(lev)
n = nrow(df)
Threshold = 2* p/n

plot(lev, main = "Leverage values", ylab = "Leverage values", xlab = "Index", pch = 1) 
abline(h = Threshold, col = "red", lwd = 1, lty = 2)
text(x = 2200,  
     y = 0.58,  
     labels = paste("Number of outliers =", length(lev[lev > Threshold])),
     col = "black", font = 2, cex = 0.8)
legend(
  "topleft",                    
  legend = c("Leverage", "Threshold"),
  col = c("black", "red"),          
  pch = c(1, NA),                
  lty = c(NA, 2),                  
  lwd = c(NA, 1),                  
  bty = "n")

From the leverage plot, can already be seen that there are more than 200 outliers.

plot(rstudent(final_model), main = "Residuals rinormalized", ylab = "Studentize residuals")
abline(h=c(-2,2), col = "red")
legend(
  "topleft",                    
  legend = c("Residuals", "Threshold"),
  col = c("black", "red"),          
  pch = c(1, NA),                
  lwd = c(NA, 1),                  
  bty = "n")

Also through the tstudent ri-normalization it clear that the residuals have outliers. So the outliers can be studied further.

Deep Dive on the smoking variable

detach(df)
outliers = which(lev < Threshold)

df["Leverage value"] = lev


ggplot(data = df)+
  geom_point(aes(x = Fumatrici, y = `Leverage value`))+
  labs( title = "Leverage values by Smoker habits", x = "Smoker", y = "Leverage Value")+
    theme_classic()+
    theme(plot.title = element_text(hjust = 0.5))

Smoking create almost every time an “out of standard” residuals of the newborn’s weight, enough to influence the regression model to be unpredictable.

df_outliers = df[-outliers, ]

ggplot(data = df_outliers)+
  geom_col(aes(x = Fumatrici), stat = "Count", fill = "5A7FA8", col = "black")+
  labs( title = "Count of data with Leverage value above threshold", x = "Smoking", y = "Count")+
    theme_classic()+
    theme(plot.title = element_text(hjust = 0.5))

notoutliers = which(lev > Threshold)
df_clean = df[-notoutliers,]


ggplot(data = df_clean)+
  geom_col(aes(x = Fumatrici), stat = "Count", fill = "5A7FA8", col = "black")+
  labs( title = "Count of data with Leverage value below threshold", x = "Smoking", y = "Count")+
    theme_classic()+
    theme(plot.title = element_text(hjust = 0.5))

This last two column plots confirm that a smoker is always out of standard, for this dataset at least, influencing the regression model.

Let’s see if a model with only non smoker datapoints reflect better the requirment for a regression model.

df_non_smoker <- df[df$Fumatrici == "Not a Smoker", ]
df_non_smoker$Fumatrici = NULL
saturated_model_nonsmok = lm(Peso ~. - `Leverage value`, data = df_non_smoker)
summary(saturated_model_nonsmok)

## 
## Call:
## lm(formula = Peso ~ . - `Leverage value`, data = df_non_smoker)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1113.79  -181.67   -14.27   162.97  2577.80 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -6736.1069   143.3650 -46.986  < 2e-16 ***
## Anni.madre        0.7243     1.1739   0.617  0.53728    
## N.gravidanze     12.5362     4.7702   2.628  0.00864 ** 
## Gestazione       34.5836     3.8885   8.894  < 2e-16 ***
## Lunghezza        10.1097     0.3062  33.018  < 2e-16 ***
## Cranio           10.4955     0.4340  24.183  < 2e-16 ***
## Tipo.partoNat    33.1134    12.3394   2.684  0.00733 ** 
## Ospedaleosp2    -10.0486    13.7665  -0.730  0.46551    
## Ospedaleosp3     33.2505    13.8054   2.409  0.01609 *  
## SessoM           79.7595    11.4224   6.983 3.74e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 274.3 on 2384 degrees of freedom
## Multiple R-squared:  0.7302, Adjusted R-squared:  0.7292 
## F-statistic:   717 on 9 and 2384 DF,  p-value: < 2.2e-16

The r-squared is similar to the previous one, with a value of 0.73.

To choose the final model for non smokers is taken in consideration:

Between different hospital should be the same
Natural delivered or cesareum is often related to a risk in dipendence of the year of the mother and pregnancy duration (“Gestazione”)
Quadratic component for the physical measurement that are one of the most important factor

attach(df_non_smoker)

modelcomparison_2 = function(model_A, model_B){
  dataanova = anova(model_A, model_B)
  pvalue = dataanova$`Pr(>F)`[2]
  
  aic_a = AIC(model_A)
  bic_a = BIC(model_A)

  aic_b = AIC(model_B)
  bic_b = BIC(model_B)
  
  summ_a = summary(model_A)
  summ_b = summary(model_B)
  r_a = round(summ_a$r.squared, 3)
  r_b = round(summ_b$r.squared, 3)
  
  mse_model_a = mean(model_A$residuals^2)
  rmse_model_a = sqrt(mse_model_a)
  
  mse_model_b = mean(model_B$residuals^2)
  rmse_model_b = sqrt(mse_model_b)
  
  model <- c("Saturated Model", "Final Model")
  AIC <- c(aic_a, aic_b)
  BIC <- c(bic_a, bic_b)
  R <- c(r_a, r_b) 
  RMSE_value <- c(rmse_model_a, rmse_model_b)
  anovatest <- c("", round(pvalue, 3))
  
  tablecomparison <- data.frame(Models = model, AIC_value = AIC, BIC_value = BIC, R_squared = R, RMSE =  RMSE_value, ANOVA_pvalue = anovatest)
  
  return(tablecomparison)
}




final_model_nosmok = update(saturated_model_nonsmok, ~. - Tipo.parto - Ospedale - Lunghezza - Cranio + Tipo.parto * Anni.madre + I(Lunghezza**2) + I(Cranio**2) )

kable(modelcomparison_2(saturated_model_nonsmok, final_model_nosmok))

Models	AIC_value	BIC_value	R_squared	RMSE	ANOVA_pvalue
Saturated Model	33687.07	33750.66	0.730	273.7448
Final Model	33637.67	33695.48	0.736	271.0479	NA

The final model for the non smoker has a better BIC and AIC parameter and a slightly better R squared around 0.74.

With the same logic it’s possible to create a model for the mothers that smoke.

detach(df_non_smoker)

df_smoker <- df[df$Fumatrici == "Smoker", ]
df_smoker$Fumatrici = NULL
attach(df_smoker)

saturated_model_smok = lm(Peso ~. - `Leverage value`, data = df_smoker)

final_model_smok = update(saturated_model_smok, ~.   - Tipo.parto - Ospedale - Lunghezza - Cranio + Tipo.parto * Anni.madre + I(Lunghezza**2) + I(Cranio**2) )

detach(df_smoker)
kable(modelcomparison_2(saturated_model_smok, final_model_smok))

Models	AIC_value	BIC_value	R_squared	RMSE	ANOVA_pvalue
Saturated Model	1453.224	1482.312	0.756	235.5555
Final Model	1449.915	1476.359	0.759	234.0780	NA

Tests

shap_nonsmok = shapiro.test(residuals(final_model_nosmok))
bp_nonsmok = bptest(final_model_nosmok)
dw_nonsmok = dwtest(final_model_nosmok)

shap_smok = shapiro.test(residuals(final_model_smok))
bp_smok = bptest(final_model_smok)
dw_smok = dwtest(final_model_smok)


model <- c("Not smoker model", "Smoker Model")
Shapiro_Test <- c(round(shap_nonsmok$p.value, 3), round(shap_smok$p.value, 3))
BP_Test <- c(round(bp_nonsmok$p.value, 3), round(bp_smok$p.value, 3))
DW_Test <- c(round(dw_nonsmok$p.value, 3), round(dw_smok$p.value, 3)) 
  
tabletests <- data.frame(Models = model, Shap_test = Shapiro_Test, BP_test = BP_Test, DW_test = DW_Test)
kable(tabletests)

Models	Shap_test	BP_test	DW_test
Not smoker model	0.000	0.000	0.037
Smoker Model	0.361	0.062	0.788

Not Smoking Model:

Shapiro test has a p-value below threshold, so the null hypothesis is rejected and the residuals does not follow a normal distribution
Breusch-Pagan Test has a p-value below threshold, so the null hypothesis is rejected and indicates heteroscedasticity
Durbin-Watson Test has a p-value slightly below the threshold of 0.05 and indicates that true autocorrelation is greater than 0

plot(final_model_nosmok)

Smoking Model:

Shapiro test has a p-value above threshold, so the null hypothesis is not rejected and the residuals does follow a normal distribution
Breusch-Pagan Test has a p-value above threshold, so the null hypothesis is not rejected and indicates homoscedasticity (residuals have a constant variance)
Durbin-Watson Test has a p-value slightly above the threshold of 0.05 and indicates that there isn’t autocorrelation

plot(final_model_smok)

3 Prediction Process

General Model

col_to_keep = c("N.gravidanze", "Fumatrici", "Gestazione", "Tipo.parto", "Sesso", "Cranio", "Lunghezza")

new_data = df[, col_to_keep]
  
prevision = predict(final_model, newdata = new_data)
  
new_data["Peso"] = prevision

new_data["Category"] = "Prediction"

df["Category"] = "Actual"
df_bind = df[, colnames(new_data)]

df_totale <- rbind(df_bind, new_data)

ggplot(data = df_totale, aes(x = Peso, fill = Category)) +
  geom_density(alpha = 0.4) +
  theme_minimal()+
  labs(title = "Distribution comparison: Actual vs Simulated - Original Dataset",
       x = "Weight", y = "Density")+
    theme(plot.title = element_text(hjust = 0.5))

With the training data the model seems stable and predict similar value to the actual model.

Now to test if the model is stable with a different dataset we can use the bootstrapping method. A new dataframe is create with samples of the row of the initial dataset so that the proportion and the underlying natural laws are conserved, but it’s still possible to see if the model can handle new dataset.

df_train = df[,col_to_keep]
new_data <- df[sample(nrow(df_train), size = nrow(df_train) * 2, replace = TRUE), ]

prediction <- predict(final_model, newdata = new_data)
new_data["Peso"] = prediction

new_data["Category"] = "Prediction"
df["Category"] = "Actual"
df_bind = df[, colnames(new_data)]

df_totale <- rbind(df_bind, new_data)

ggplot(data = df_totale, aes(x = Peso, fill = Category)) +
  geom_density(alpha = 0.4) +
  theme_minimal()+
  labs(title = "Distribution comparison: Actual vs Simulated - Bootstrapping",
       x = "Weight", y = "Density")+
    theme(plot.title = element_text(hjust = 0.5))

The model is still stable even with the new dataset generated from the training data. With the same logic is possible to generate a new synthetic dataset that reflect the underlying natural laws, so that there isn’t possibility of impossible combination of, for example, Cranium size, lenght and newborn’s weight.

new_data <- mysynt$syn

prediction <- predict(final_model, newdata = new_data)

new_data["Peso"] = prediction

new_data["Category"] = "Prediction"
df["Category"] = "Actual"
df_bind = df[, colnames(new_data)]

df_totale <- rbind(df_bind, new_data)

ggplot(data = df_totale, aes(x = Peso, fill = Category)) +
  geom_density(alpha = 0.4) +
  theme_minimal()+
  labs(title = "Distribution comparison: Actual vs Simulated - Synthetic Dataset",
       x = "Weight", y = "Density")+
    theme(plot.title = element_text(hjust = 0.5))

The model remain stable and accurate with the synthetic dataset too.

Non Smoker Model

col_to_keep = c("N.gravidanze", "Anni.madre", "Gestazione", "Tipo.parto", "Sesso", "Cranio", "Lunghezza")

new_data = df_non_smoker[, col_to_keep]
  
prevision = predict(final_model_nosmok, newdata = new_data)
  
new_data["Peso"] = prevision

new_data["Category"] = "Prediction"

df_non_smoker["Category"] = "Actual"
df_bind = df_non_smoker[, colnames(new_data)]

df_totale <- rbind(df_bind, new_data)

ggplot(data = df_totale, aes(x = Peso, fill = Category)) +
  geom_density(alpha = 0.4) +
  theme_minimal()+
  labs(title = "Distribution comparison: Actual vs Simulated - Original Dataset",
       x = "Weight", y = "Density")+
    theme(plot.title = element_text(hjust = 0.5))

The model seems stable and predict similar value to the actual dataset

Bootstrapping method.

df_train = df_non_smoker[,col_to_keep]
new_data <- df_non_smoker[sample(nrow(df_train), size = nrow(df_train) * 2, replace = TRUE), ]

prediction <- predict(final_model_nosmok, newdata = new_data)
new_data["Peso"] = prediction

new_data["Category"] = "Prediction"
df_non_smoker["Category"] = "Actual"
df_bind = df_non_smoker[, colnames(new_data)]

df_totale <- rbind(df_bind, new_data)

ggplot(data = df_totale, aes(x = Peso, fill = Category)) +
  geom_density(alpha = 0.4) +
  theme_minimal()+
  labs(title = "Distribution comparison: Actual vs Simulated - Bootstrapping",
       x = "Weight", y = "Density")+
    theme(plot.title = element_text(hjust = 0.5))

The model is still stable even with the new dataset generated from the training data. As before now it’s possible to study the model with a synthetic dataset.

new_data <- mysynt$syn

prediction <- predict(final_model_nosmok, newdata = new_data)

new_data["Peso"] = prediction

new_data["Category"] = "Prediction"
df_non_smoker["Category"] = "Actual"
df_bind = df_non_smoker[, colnames(new_data)]

df_totale <- rbind(df_bind, new_data)

ggplot(data = df_totale, aes(x = Peso, fill = Category)) +
  geom_density(alpha = 0.4) +
  theme_minimal()+
  labs(title = "Distribution comparison: Actual vs Simulated - Synthetic Dataset",
       x = "Weight", y = "Density")+
    theme(plot.title = element_text(hjust = 0.5))

The model remain stable and accurate with the synthetic dataset too.

Smoker Model

col_to_keep = c("N.gravidanze", "Anni.madre", "Gestazione", "Tipo.parto", "Sesso", "Cranio", "Lunghezza")

new_data = df_smoker[, col_to_keep]
  
prevision = predict(final_model_smok, newdata = new_data)
  
new_data["Peso"] = prevision

new_data["Category"] = "Prediction"

df_smoker["Category"] = "Actual"
df_bind = df_smoker[, colnames(new_data)]

df_totale <- rbind(df_bind, new_data)

ggplot(data = df_totale, aes(x = Peso, fill = Category)) +
  geom_density(alpha = 0.4) +
  theme_minimal()+
  labs(title = "Distribution comparison: Actual vs Simulated - Original Dataset",
       x = "Weight", y = "Density")+
    theme(plot.title = element_text(hjust = 0.5))

df_train = df_smoker[,col_to_keep]
new_data <- df_smoker[sample(nrow(df_train), size = nrow(df_train) * 2, replace = TRUE), ]

prediction <- predict(final_model_nosmok, newdata = new_data)
new_data["Peso"] = prediction

new_data["Category"] = "Prediction"
df_smoker["Category"] = "Actual"
df_bind = df_smoker[, colnames(new_data)]

df_totale <- rbind(df_bind, new_data)

ggplot(data = df_totale, aes(x = Peso, fill = Category)) +
  geom_density(alpha = 0.4) +
  theme_minimal()+
  labs(title = "Distribution comparison: Actual vs Simulated - Bootstrapping",
       x = "Weight", y = "Density")+
    theme(plot.title = element_text(hjust = 0.5))

new_data <- mysynt$syn

prediction <- predict(final_model_smok, newdata = new_data)

new_data["Peso"] = prediction

new_data["Category"] = "Prediction"
df_smoker["Category"] = "Actual"
df_bind = df_smoker[, colnames(new_data)]

df_totale <- rbind(df_bind, new_data)

ggplot(data = df_totale, aes(x = Peso, fill = Category)) +
  geom_density(alpha = 0.4) +
  theme_minimal()+
  labs(title = "Distribution comparison: Actual vs Simulated - Synthetic Dataset",
       x = "Weight", y = "Density")+
    theme(plot.title = element_text(hjust = 0.5))

Similar to the previous model, the Smoker model, is stable with the three different prediction simulation.

4 Variable importance

Now it’s possible to understand how the different variable impact the linear regression model

4.1 Gender

ggplot(data = df)+
  geom_point(aes(x = Lunghezza * Cranio,
                 y = Peso,
                 col = Sesso), size = 0.8, position = "jitter")+
  geom_smooth(aes(x = Lunghezza * Cranio,
                  y = Peso,
                  col = Sesso), se = FALSE, method = "lm")+  
  theme_minimal()+
  labs(title = "Newborn's Measurments: Linear Regression by Gender",
       x = "Body Length * Cranium Size", y = "Weight", color = "Gender")+
    theme(plot.title = element_text(hjust = 0.5))

The gender variable determine two different regression where the male ones have higher measurments, but with same slope, so same relation between the three variable, as expected.

ggplot(data = df)+
  geom_point(aes(x = Gestazione,
                 y = Peso,
                 col = Sesso), size = 0.8, position = "jitter")+
  geom_smooth(aes(x = Gestazione,
                  y = Peso,
                  col = Sesso), se = FALSE, method = "lm")+  
  theme_minimal()+
  labs(title = "Number of pregnancy month: Linear Regression by Gender",
       x = "# Months", y = "Weight", color = "Gender")+
    theme(plot.title = element_text(hjust = 0.5))

The same behavior can be seen with the duration of the pregnancy.

Also for both charts is it possible to see that the male cluster is above the female one in general, as expected.

4.3 Mother’s Age

Another important factor for a good pregnancy is often the mother’s age.

df$MotAge <- cut(
  df$Anni.madre,
  breaks = c(0, 18, 30, max(df$Anni.madre)),
  labels = c("Minors", "18-30", "30+"),
  right = TRUE,  
  include.lowest = TRUE
)


ggplot(data = df)+
  geom_point(aes(x = Lunghezza * Cranio,
                 y = Peso,
                 col = MotAge), size = 0.8, position = "jitter", alpha = 1/5)+
  geom_smooth(aes(x = Lunghezza * Cranio,
                  y = Peso,
                  col = MotAge), se = FALSE, method = "lm", linewidth = 0.5)+  
  theme_minimal()+
  labs(title = "Newborn's Measurments: Linear Regression by Mother's age",
       x = "Body Length * Cranium Size", y = "Weight", color = "Age Range")+
    theme(plot.title = element_text(hjust = 0.5))+xlim(135000,190000)+ylim(2000,4000)

Different age lead to different slope, for example:

The relation between the three physical measurements for a minor mother has the smaller slope, so a newborn’s with bigger cranium or body length can lead to a smaller body weight compared to an older mom, this, probably, is because a young mother has a less stable and prepared body.
A mother in the range of 30+ age has the highest slope, so smaller change can lead to bigger impact on the newborn’s weight, so an older mother is more unpredictable.

ggplot(data = df)+
  geom_point(aes(x = Gestazione,
                 y = Peso,
                 col = MotAge), size = 0.8, position = "jitter")+
  geom_smooth(aes(x = Gestazione,
                  y = Peso,
                  col = MotAge), se = FALSE, method = "lm")+  
  theme_minimal()+
  labs(title = "Number of pregnancy month: Linear Regression by Mother's age",
       x = "# Months", y = "Weight", color = "Age range")+
    theme(plot.title = element_text(hjust = 0.5))

Considering the number of months of the pregnancy the same behavior is highlighted. The most stable and reliable age range is between 18 and 30.

4.3 Smoking habits

 g1 = ggplot(data = df)+
  geom_point(aes(x = Lunghezza * Cranio,
                 y = Peso,
                 col = Fumatrici), size = 0.8, position = "jitter")+
  geom_smooth(aes(x = Lunghezza * Cranio,
                  y = Peso,
                  col = Fumatrici), se = FALSE, method = "lm")+  
  theme_minimal()+ylim(3100, 4100)+xlim(150000, 190000)+labs(color = "Smoking Habits")+xlab(NULL)+ylab(NULL)

 g2 = ggplot(data = df)+
  geom_point(aes(x = Lunghezza * Cranio,
                 y = Peso,
                 col = Fumatrici), size = 0.8, position = "jitter")+
  geom_smooth(aes(x = Lunghezza * Cranio,
                  y = Peso,
                  col = Fumatrici), se = FALSE, method = "lm")+  
  theme_minimal()+
  labs(title = "Newborn's Measurments: Linear Regression by smoking habits",
       x = "Body Length * Cranium Size", y = "Weight", color = "Smoking habits")+
    theme(plot.title = element_text(hjust = -0.3), legend.position = "none", 
          axis.title.x = element_text(hjust = 4,  vjust = -1))
 
 g2 | g1

Instead of gender, smoking habits influence the relation between Cranium size and body length with the newborn’s weight. The slope is higher in the smoker cluster, and taken as normal the non smoker cluster, this means that with smaller changes the output can change faster, so this means more unpredictability.

ggplot(data = df)+
  geom_point(aes(x = Gestazione,
                 y = Peso,
                 col = Fumatrici), size = 0.8, position = "jitter")+
  geom_smooth(aes(x = Gestazione,
                  y = Peso,
                  col = Fumatrici), se = FALSE, method = "lm")+  
  theme_minimal()+
  labs(title = "Number of pregnancy months: Linear Regression by smoking habits",
       x = "# months", y = "Weight", color = "Smoking habits")+
    theme(plot.title = element_text(hjust = 0.5))+xlim(30, max(df$Gestazione))

The image above highlight perfectly the impact of smoking. More month of pregnancy doesn’t lead to an increase body weight. So smoking can be the reason of underweight babies. It’s also clear from the chart, that the datapoint of the smoking cluster is at the edge of the non smoking group, with less point above the pink regression line and more under the line.

5 Conclusion

This project lead to some very important conclusion highlighting the impact of different variable to the prediction of the newborn’s baby. The most important output was understanding the real impact of smoking habits; smoking lead to an unpredictable pregnancy, so Neonatal Health Solutions should invest in preventing smoking during the pregnancy. Another important component is the age of the mother, at a very young age the body of the mother is not ready for a long pregnancy, so the prevention of unwanted pregnancies and a good care of the health of the body of the mother can helps pregnancy at a very young age; meanwhile at an older age (30+) the first months of the pregnancy seem to be the most dangerous, so the prevention and care should start at the very beginning of the pregnancy. Also, from the models, the body measurements of the newborn’s can be a very important indicator of the future weight of the baby, so Neonatal Health Solutions should invest in better equipment to measure body length and cranium size.

Statistical Model for Predicting Neonatal Weight

Stefano Arrigoni