library(modelsummary)
library(patchwork)
library(dplyr) 
library(ggplot2)
library(moments)
library(lmtest)
library(lubridate)
library(knitr)
library(gghalves)
library(RColorBrewer)
library(MASS)
library(car)
library(synthpop)




df = read.csv("neonati.csv", stringsAsFactors = TRUE)

df$Fumatrici <- factor(df$Fumatrici, 
                        levels = c(0, 1),
                        labels = c("Not a Smoker", "Smoker"))

First Visualization

kable(summary(df), caption = "Summary of the dataset")
Summary of the dataset
Anni.madre N.gravidanze Fumatrici Gestazione Peso Lunghezza Cranio Tipo.parto Ospedale Sesso
Min. : 0.00 Min. : 0.0000 Not a Smoker:2396 Min. :25.00 Min. : 830 Min. :310.0 Min. :235 Ces: 728 osp1:816 F:1256
1st Qu.:25.00 1st Qu.: 0.0000 Smoker : 104 1st Qu.:38.00 1st Qu.:2990 1st Qu.:480.0 1st Qu.:330 Nat:1772 osp2:849 M:1244
Median :28.00 Median : 1.0000 NA Median :39.00 Median :3300 Median :500.0 Median :340 NA osp3:835 NA
Mean :28.16 Mean : 0.9812 NA Mean :38.98 Mean :3284 Mean :494.7 Mean :340 NA NA NA
3rd Qu.:32.00 3rd Qu.: 1.0000 NA 3rd Qu.:40.00 3rd Qu.:3620 3rd Qu.:510.0 3rd Qu.:350 NA NA NA
Max. :46.00 Max. :12.0000 NA Max. :43.00 Max. :4930 Max. :565.0 Max. :390 NA NA NA

Age of the mother smaller than 12-14 is probably errors in the dataset, so here the cleaned dataset. Even if 12-14 is still a very young age it happens, so it’s going to be keep in the dataset.

df <- df[df$Anni.madre > 12, ]
attach(df)

kable(summary(df), caption = "Summary of the dataset")
Summary of the dataset
Anni.madre N.gravidanze Fumatrici Gestazione Peso Lunghezza Cranio Tipo.parto Ospedale Sesso
Min. :13.00 Min. : 0.0000 Not a Smoker:2394 Min. :25.00 Min. : 830 Min. :310.0 Min. :235 Ces: 728 osp1:816 F:1255
1st Qu.:25.00 1st Qu.: 0.0000 Smoker : 104 1st Qu.:38.00 1st Qu.:2990 1st Qu.:480.0 1st Qu.:330 Nat:1770 osp2:848 M:1243
Median :28.00 Median : 1.0000 NA Median :39.00 Median :3300 Median :500.0 Median :340 NA osp3:834 NA
Mean :28.19 Mean : 0.9816 NA Mean :38.98 Mean :3284 Mean :494.7 Mean :340 NA NA NA
3rd Qu.:32.00 3rd Qu.: 1.0000 NA 3rd Qu.:40.00 3rd Qu.:3620 3rd Qu.:510.0 3rd Qu.:350 NA NA NA
Max. :46.00 Max. :12.0000 NA Max. :43.00 Max. :4930 Max. :565.0 Max. :390 NA NA NA

1. Preliminary Analysis

A. Variable:

Explanatory variable:

  • “Anni.Madre”: age of the mother; it’s a discrete quantitative variable.
  • “N.gravidanze”: number of pregnancy of the mother; it’s a discrete quantitative variable.
  • “Fumatrici”: binary variable (dummy variable); 1 = mother is a smoker - 0 = mother is not a smoker.
  • “Gestazione”: number of week of the pregnancy; it’s a discrete quantitative variable.
  • “Lunghezza”: newborn’s length in mm; it’s a continous quantitative variable.
  • “Cranio”: newborn’s head diameter in mm; it’s a continous quantitative variable.
  • “Tipo.parto”: type of delivery, cesarean or natural; it’s a qualitative variable, with two class.
  • “Ospedale”: Identification of the hospital; it’s a qualitative variable with three class.
  • “Sesso”: newborn’s gender; it’s a qualitative variable with two class.

The qualitative variable are converted to factor, so that can be used as class during the analysis.

Target Variable:

  • “Peso”: newborn’s weight in grams;it’s a continous quantitative variable.

B. Distribution and Box Plot:

Quantitative Variable
  • Continous
gini_index_func = function(col){
  n_i = table(col)
  f_i_2 = (table(col)/length(col)) ^ 2
  j = length(table(col))
  gini = 1 - sum(f_i_2)
  gini_norm = gini/((j - 1)/j)
  return(gini_norm)}


density_form_con = function(column, label_name, y_cord, y_cord_2){
  kurt = round(kurtosis({{column}}) - 3, 1)
  fis = round(skewness({{column}}), 1)
  
  string_1 = paste("Kurtosis Index = ", kurt)
  
  string_2 = paste("Fischer Index = ", fis)
  
  plot_title = paste0("Newborn's ", label_name, " Distribution" )
  
  
  density_value = density({{column}})
  
  
  g1 = ggplot(data = df)+
    geom_half_violin(aes(y = {{column}}),side = "left", col = "#5A7FA8", linewidth =
                       0.7)+
    geom_half_boxplot(aes(y = {{column}}), fill = "#5A7FA8")+
    labs(title = plot_title, x = label_name, y = "")+
    theme_classic()+
    theme(plot.title = element_text(hjust = 0.5))+
    annotate("label", label = string_1, x = 0.7, y = y_cord, fill = "white", vjust = 1,
             hjust = 1)+
    annotate("label", label = string_2, x = 0.7, y = y_cord_2, 
             fill = "white",
             vjust = 1, hjust = 1)
  
  return(g1)
}
plt_cra = density_form_con(Cranio, "Cranium diameter", 400, 388)

plt_cra

The newborn’s Cranium diameter has an high peak; the shape is capture by the Kurtosis index that is greater than 0. The Fischer index suggest correctly the asymmetry of the newborn’s Cranium diameter; with a negative asymmetry. From the boxplot is also possible to see various outliers.

plt_len = density_form_con(Lunghezza, "Length (mm)", 560, 542)

plt_len

The newborn’s length has a more elongated shape, with an higher peak. This shape it’s also capture by the Kurtosis index that is greater than 0. The same logic of before can be applied for the Fischer index that capture the negative asymmetry of the distribution. Similar to the Cranium size the boxplot has various outlier.

plt_wei = density_form_con(Peso, "weight", 5000, 4720)

plt_wei

The target variable has a kurtosis index equal to 2, that means a more elongated shape and an higher peak in comparison with the normal distribution. Fischer index is slightly negative, so the shape is similar to a normal distribution with a small negative asymmetry.

shap = shapiro.test(Peso)
kable(paste("The target variable does not have a normal distribution, this can be seen from the p-value of the shapiro test, because is smaller than 0.05 and so the null hypothesis is rejected: p-value = ", round(shap$p.value,3)), col.names = "")
The target variable does not have a normal distribution, this can be seen from the p-value of the shapiro test, because is smaller than 0.05 and so the null hypothesis is rejected: p-value = 0
  • Discrete:
density_form_dis = function(column, label_name, y_cord){
  kurt = round(kurtosis({{column}}) - 3, 1)
  fis = round(skewness({{column}}), 1)
  

  string_1 = paste("Kurtosis Index = ", kurt)

  string_2 = paste("Fischer Index = ", fis)

  plot_title = paste0(label_name, " Distribution" )


  g1 = ggplot(data = df)+
    geom_boxplot(aes(y = {{column}}), fill = "#5A7FA8")+
    labs(title = plot_title, y = label_name, x = "Count")+
    theme_classic()+
    theme(plot.title = element_text(hjust = 0.5))+
    annotate("label", label = string_1, x = 0.7, y = 45, fill = "white", vjust = 1,
             hjust = 1)+
   annotate("label", label = string_2, x = 0.7, y = y_cord, fill = "white", vjust = 1,
           hjust = 1)

  g2 = ggplot(data = df)+
    geom_col(aes(x = {{column}}),stat = "count",col = "black" , fill = "#5A7FA8",
             linewidth = 0.7)+
    labs(title = plot_title, x = label_name, y = "")+
    theme_classic()+
    theme(plot.title = element_text(hjust = 0.5))
  return(g1 | g2)
}
plt_ges = density_form_dis(Gestazione, "Month of Pregnancy", 43.5)


plt_ges

The month of pregnancy has the same asymmetry and shape as “Lughezza” and “Cranio”, this behavior is very clear from the two index evaluate. From the boxplot is also possible to see some outliers especially in the lower range of the variable.

plt_age = density_form_dis(Anni.madre, "Mother Age", 41.5)


plt_age

Instead the age distribution has a shape equal to the normal distribution. The bell shape is reflected also in the index that are very close to 0.

ggplot()+
  geom_col(aes(x = N.gravidanze), stat = "count", col = "black", fill = "#5A7FA8")+
  labs(title = "# Pregnancy Distribution", x = "# Pregnancy", y = "Count", caption = "Caption: Visualization of the frequency distribution for the number of pregnancy")+
  theme_classic()+
  theme(plot.title = element_text(hjust = 0.5),
        plot.caption = element_text(hjust = 0.45,vjust = -1,face = "bold",size = 10, colour = "black" ))

Qualitative Variable

Below are plotted the frequency for all the qualitative variable. It’s also evaluated the Gini Index, where an Index near 1 means that the variable is evenly distributed across the classes; instead a Gini index equal to zero is related to a variable that has one dominant class.

viz_qual_func = function(x, col_label, y_cord,x_cord, h_cord, v_cord, size_val){
  
  
  gini_string = paste("Gini Index = ", round(gini_index_func({{x}}) , 1))
  title_string = paste("Frequency Distribution: ", col_label)
  
  g1 = ggplot()+
    geom_col(aes( x = {{x}}), stat = "count", fill = "#5A7FA8", col = "black")+
    labs( title = title_string, x = col_label, y = "Count")+
    theme_classic()+
    theme(plot.title = element_text(hjust = 0.5))+
     annotate("label", label = gini_string, 
              x = x_cord, y = y_cord,
              vjust = v_cord,
              hjust = h_cord, 
              alpha = 0.3, size = size_val)
  return(g1)
}


plot_fum = viz_qual_func(Fumatrici, "Smoker", Inf, Inf, 1.1, 1, 4)
plot_fum

plot_del = viz_qual_func(Tipo.parto, "Cesarean or Natural", Inf, -Inf, -0.1, 1, 4)
plot_del

plot_ho = viz_qual_func(Ospedale, "Hospital", Inf, Inf, 1, 1, 3.2)
plot_gen = viz_qual_func(Sesso, "Male or Female", Inf, Inf, 1.1, 1, 3.2)

plot_ho | plot_gen

The dataset is evenly distributed between the Hospitals and the newborn’s gender; also the mode of delivery is evenly distributed with a Gini index of 0.8 and a slight predominance of the Natural delivery. The First figure tells us that the mother are almost everytime non smoker in this dataset.

C. Hypothesis test:

  • H_{0}: In some hospitals, more cesarean sections are performed

Let’s first visualize the frequency table observed between the three hospitals.

ggplot()+
  geom_col(aes(x = Tipo.parto, fill = Ospedale), stat = "count", position = "dodge")+
  labs(title = "Mode of delivery by hospital", x = "Mode of delivery", y = "Count")+
  theme_classic()+theme(plot.title = element_text(hjust = 0.5))

To reject or not the null hypothesis a chi-squared test can be perform.

chi = chisq.test(table(Ospedale, Tipo.parto))

kable(paste("Chi-Squared test p-value = ", round(chi$p.value, 2) , ". So the null hypothesis is rejected and from the dataset the number of cesarean section is indipendent from the hospital"), col.names = "")
Chi-Squared test p-value = 0.58 . So the null hypothesis is rejected and from the dataset the number of cesarean section is indipendent from the hospital
  • The average weight and length of this dataset is statistical representative of the population

WEIGHT:

The known average weight of the newborn’s is 3300g.

test_data =t.test(Peso, mu = 3300, conf.level = 0.95)


kable(paste("Taken the population mean for the newborn's weight as 3300g a T-test is performed with a confidence interval of 95% considering both side for the t-test. The p-value is equal to ", round(test_data$p.value, 3), ". So the null hypothesis is rejected and the mean is not equal to the population mean."), col.names = "")
Taken the population mean for the newborn’s weight as 3300g a T-test is performed with a confidence interval of 95% considering both side for the t-test. The p-value is equal to 0.132 . So the null hypothesis is rejected and the mean is not equal to the population mean.
kable(paste("However the population mean is within the confidence level of [", round(test_data$conf.int[1],2),round(test_data$conf.int[2], 2), "]. So the difference can be attributed to the data selection."), col.names = "")
However the population mean is within the confidence level of [ 3263.58 3304.79 ]. So the difference can be attributed to the data selection.
kable("Considering the confindence level analysis the newborn's weight of the dataset is significantly equal to the population mean.", col.names = "")
Considering the confindence level analysis the newborn’s weight of the dataset is significantly equal to the population mean.

LENGHT:

The known average weight of the newborn’s is 500mm.

test_data =t.test(Lunghezza, mu = 500, conf.level = 0.95)


kable(paste("Taken the population mean for the newborn's length as 500mm a T-test is performed with a confidence interval of 95% considering both side for the t-test. The p-value is equal to ", round(test_data$p.value, 3), ". So the null hypothesis is rejected and the mean is not equal to the population mean."), col.names = "")
Taken the population mean for the newborn’s length as 500mm a T-test is performed with a confidence interval of 95% considering both side for the t-test. The p-value is equal to 0 . So the null hypothesis is rejected and the mean is not equal to the population mean.
kable(paste("The population mean is also outside the confidence interval of [", round(test_data$conf.int[1],2),round(test_data$conf.int[2], 2), "]. So the difference can not be attributed to the data selection."), col.names = "")
The population mean is also outside the confidence interval of [ 493.66 495.73 ]. So the difference can not be attributed to the data selection.
kable("The newborn's length of the dataset is significantly different to the population mean.", col.names = "")
The newborn’s length of the dataset is significantly different to the population mean.
  • The average of weight,length and cranium diameter statistically different between the gender

To answer a t-test is performed, with a confidence level of 95% and considering both side of the distribution. The t-test is performed considering the various measurements in relation with the gender.

test_data = t.test(data = df, Peso ~ Sesso)

ggplot()+
  geom_boxplot(aes(y = Peso, x = Sesso, fill = Sesso))+
  labs(title = "Weight Distribution by gender", x = "Gender", y = "Weight")+
  theme_classic()+theme(plot.title = element_text(hjust = 0.5))+
  annotate("label", label = paste("Mean = ", round(test_data$estimate[2], 2)), 
              x = 2.3, y = 1000,
              alpha = 0.3)+
  annotate("label", label = paste("Mean = ", round(test_data$estimate[1], 2)), 
              x = 0.7, y = 1000,
              alpha = 0.3)

kable(paste("p-value = ", round(test_data$p.value,3), ". So the null hypotesis is rejected and the weight mean between the two gender is statistically different."), col.names = "")
p-value = 0 . So the null hypotesis is rejected and the weight mean between the two gender is statistically different.
test_data = t.test(data = df, Lunghezza ~ Sesso)

ggplot()+
  geom_boxplot(aes(y = Lunghezza, x = Sesso, fill = Sesso))+
  labs(title = "Length Distribution by gender", x = "Gender", y = "Weight")+
  theme_classic()+theme(plot.title = element_text(hjust = 0.5))+
  annotate("label", label = paste("Mean = ", round(test_data$estimate[2], 2)), 
              x = 2.3, y = 300,
              alpha = 0.3)+
  annotate("label", label = paste("Mean = ", round(test_data$estimate[1], 2)), 
              x = 0.7, y = 300,
              alpha = 0.3)

kable(paste("p-value = ", test_data$p.value, ". So the null hypotesis is rejected and the length mean between the two gender is statistically different."), col.names = "")
p-value = 2.23232824860459e-21 . So the null hypotesis is rejected and the length mean between the two gender is statistically different.
test_data = t.test(data = df, Cranio ~ Sesso)

ggplot()+
  geom_boxplot(aes(y = Cranio, x = Sesso, fill = Sesso))+
  labs(title = "Cranium size Distribution by gender", x = "Gender", y = "Weight")+
  theme_classic()+theme(plot.title = element_text(hjust = 0.5))+
  annotate("label", label = paste("Mean = ", round(test_data$estimate[2], 2)), 
              x = 2.3, y = 240,
              alpha = 0.3)+
  annotate("label", label = paste("Mean = ", round(test_data$estimate[1], 2)), 
              x = 0.7, y = 240,
              alpha = 0.3)

kable(paste("p-value = ", test_data$p.value, ". So the null hypotesis is rejected and the Cranium diameter mean between the two gender is statistically different."), col.names = "")
p-value = 1.41401856590502e-13 . So the null hypotesis is rejected and the Cranium diameter mean between the two gender is statistically different.

2. Creation of the Regression Model

model_1 = lm(Peso ~., data = df)
summary(model_1)
## 
## Call:
## lm(formula = Peso ~ ., data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1123.26  -181.53   -14.45   161.05  2611.89 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -6735.7960   141.4790 -47.610  < 2e-16 ***
## Anni.madre          0.8018     1.1467   0.699   0.4845    
## N.gravidanze       11.3812     4.6686   2.438   0.0148 *  
## FumatriciSmoker   -30.2741    27.5492  -1.099   0.2719    
## Gestazione         32.5773     3.8208   8.526  < 2e-16 ***
## Lunghezza          10.2922     0.3009  34.207  < 2e-16 ***
## Cranio             10.4722     0.4263  24.567  < 2e-16 ***
## Tipo.partoNat      29.6335    12.0905   2.451   0.0143 *  
## Ospedaleosp2      -11.0912    13.4471  -0.825   0.4096    
## Ospedaleosp3       28.2495    13.5054   2.092   0.0366 *  
## SessoM             77.5723    11.1865   6.934 5.18e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 274 on 2487 degrees of freedom
## Multiple R-squared:  0.7289, Adjusted R-squared:  0.7278 
## F-statistic: 668.7 on 10 and 2487 DF,  p-value: < 2.2e-16

Non-Categorical Variable:

Categorical Variable:

For categorical variable the summary needs to be read slightly different. The coefficient means that taken as constant the other variable a change of class define a change of the output variable equal to the coefficient.

2 Searching the best Regression Model

The starting point of the saturated model (model_1) is:

2.1 Stepwiese method

Correllation Matrix

panel.cor <- function(x, y, digits = 2, prefix = "", cex.cor, ...)
{
    par(usr = c(0, 1, 0, 1))
    r <- abs(cor(x, y))
    txt <- format(c(r, 0.123456789), digits = digits)[1]
    txt <- paste0(prefix, txt)
    if(missing(cex.cor)) cex.cor <- 0.8/strwidth(txt)
    text(0.5, 0.5, txt, cex = cex.cor * r)
}

pairs(df, upper.panel = panel.smooth, lower.panel = panel.cor)

Above there is a plot of the correlation Matrix where is possible to see the correlation between the quantitative variable.

As expected the three physical measurement are highly correlated, with length and cranium size that are related with a coefficient of 0.60 and are respectively correlated with the weight with a coefficient of 0.80 and 0.70. The correlation matrix confirm the relation between “Gestazione” and the physical measurements, with coefficient 0.59 with weight, 0.62 with length, 0.46 with Cranium size; this aligns with what it’s known about pregnancy. From the chart it’s also possible to see that the relation between the months of the pregnancy and the physical measurements starts with a linear behavior and has a sort of plateau reached a certain body size.

The high correlation between cranium size and length suggest that it can be interaction within the structure of the model, as well as length and “Gestazione”.

Searching the right model

modelcomparison = function(model_A, model_B){
  dataanova = anova(model_A, model_B)
  pvalue = dataanova$`Pr(>F)`[2]
  
  aic_a = AIC(model_A)
  bic_a = BIC(model_A)

  aic_b = AIC(model_B)
  bic_b = BIC(model_B)
  
  summ_a = summary(model_A)
  summ_b = summary(model_B)
  r_a = round(summ_a$r.squared, 3)
  r_b = round(summ_b$r.squared, 3)
  
  mse_model_a = mean(model_A$residuals^2)
  rmse_model_a = sqrt(mse_model_a)
  
  mse_model_b = mean(model_B$residuals^2)
  rmse_model_b = sqrt(mse_model_b)
  
  model <- c("Starting Model", "New Model")
  AIC <- c(aic_a, aic_b)
  BIC <- c(bic_a, bic_b)
  R <- c(r_a, r_b) 
  RMSE_value <- c(rmse_model_a, rmse_model_b)
  anovatest <- c("", round(pvalue, 3))
  
  tablecomparison <- data.frame(Models = model, AIC_value = AIC, BIC_value = BIC, R_squared = R, RMSE =  RMSE_value, ANOVA_pvalue = anovatest)
  
  return(tablecomparison)
}
  • I. There should be no difference between different hospital.
model_2 = update(model_1, ~. - Ospedale)
table12 = modelcomparison(model_1, model_2)
kable(table12)
Models AIC_value BIC_value R_squared RMSE ANOVA_pvalue
Starting Model 35145.57 35215.45 0.729 273.4174
New Model 35150.75 35208.98 0.728 273.9203 0.01

The anova test suggest that without the hospital variable the model are statistically different to the first model, with a p-value around 0.01 that is lower than 0.05. From the table above the BIC coefficient for model one (saturated model) is slightly lower to the new model’s coefficient, on the other side the AIC coefficient tells the opposite story. This suggests, with the R\(^{2}\) value that is the same between the two model, that the two model explain the same amount of information about the newborn’s weight, but the AIC parameter prefer over-parametrize model and BIC prefer simpler model. In this case simpler model is better because can lead to identify better what really influence the weight considering that there should be no difference between hospitals.

    1. Years of the mother have a low impact on the weight:
model_3 = update(model_2, ~. - Anni.madre)
table23 = modelcomparison(model_2, model_3)
kable(table23)
Models AIC_value BIC_value R_squared RMSE ANOVA_pvalue
Starting Model 35150.75 35208.98 0.728 273.9203
New Model 35149.33 35201.73 0.728 273.9518 0.45

The test confirm the behavior suggested by the summary of the saturated model. - pvalue of the anova test above the threshold of 0.05, so the two model are statistically equal. - BIC and AIC coefficient are very close, so the information explained by the two model remain equal, with a slight improvement in the coefficients of the new model especially in the BIC coefficient.

    1. Studying the interaction between cranium size and length:
model_4 = update(model_3, ~. - Cranio)
table34 = modelcomparison(model_3, model_4)
kable(table34)
Models AIC_value BIC_value R_squared RMSE ANOVA_pvalue
Starting Model 35149.33 35201.73 0.728 273.9518
New Model 35692.26 35738.84 0.661 305.5233 0

Without the Cranium size variable R\(^{2}\) goes from 0.73 to 0.66, so the percentage of information explained by the new model decrease significantly; this is confirmed by the pvalue of the anova test that is below the threshold. BIC and AIC coeffienct confirm that the starting model is better.

model_5 = update(model_3, ~. -Cranio - Lunghezza + Cranio * Lunghezza)
table35 = modelcomparison(model_3, model_5)
kable(table35)
Models AIC_value BIC_value R_squared RMSE ANOVA_pvalue
Starting Model 35149.33 35201.73 0.728 273.9518
New Model 35128.71 35186.95 0.730 272.7146 0

Introducing an interaction factor between Cranium size and body length the anova test tells that the two model are statistically different and the BIC/AIC coefficient suggest that the model with the interaction factor explain better the newborn’s weight, with a lower BIC coefficient. In this case also the AIC coefficient is better for the new model. Meanwhile the percentage of the phenomenon explained by the model remain stable.

    1. Non linearity:

Is very clear that the physical measurements is clearly one of the best way to predict the newborn’s weight. So it can be studied if the non linearity coefficient of the length and Cranio can make the model improve,

final_model = update(model_5, ~.  + I(Lunghezza**2) + I(Cranio**2))
table5f = modelcomparison(model_5, final_model)
kable(table5f)
Models AIC_value BIC_value R_squared RMSE ANOVA_pvalue
Starting Model 35128.71 35186.95 0.73 272.7146
New Model 35042.83 35112.71 0.74 267.8520 0

The non linear coefficient of the newborn’s length improve the model with a general imporvement of R\(^{2}\), AIC, BIC coefficients and RMSE.

3 Model Quality Analysis

WINNING MODEL

kable(final_model$coefficients, col.names = "Coefficient values")
Coefficient values
(Intercept) -770.1966995
N.gravidanze 14.5143328
FumatriciSmoker -23.5146476
Gestazione 41.1071786
Tipo.partoNat 28.4210553
SessoM 71.8929637
Cranio 4.2662118
Lunghezza -11.7036094
I(Lunghezza^2) 0.0438045
I(Cranio^2) 0.0524431
Cranio:Lunghezza -0.0599864

Multicollinearity

vif_table = vif(final_model, type = "predictor")

kable(vif_table, caption = "Multicollinearity coefficients")
Multicollinearity coefficients
GVIF Df GVIF^(1/(2*Df)) Interacts With Other Predictors
N.gravidanze 1.028617 1 1.014208 Fumatrici, Gestazione, Tipo.parto, Sesso, Cranio, Lunghezza
Fumatrici 1.007964 1 1.003974 N.gravidanze, Gestazione, Tipo.parto, Sesso, Cranio, Lunghezza
Gestazione 1.853892 1 1.361577 N.gravidanze, Fumatrici, Tipo.parto, Sesso, Cranio, Lunghezza
Tipo.parto 1.004512 1 1.002253 N.gravidanze, Fumatrici, Gestazione, Sesso, Cranio, Lunghezza
Sesso 1.048700 1 1.024061 N.gravidanze, Fumatrici, Gestazione, Tipo.parto, Cranio, Lunghezza
Cranio 1.917242 5 1.067254 I(Cranio^2), Lunghezza N.gravidanze, Fumatrici, Gestazione, Tipo.parto, Sesso
Lunghezza 1.917242 5 1.067254 I(Lunghezza^2), Cranio N.gravidanze, Fumatrici, Gestazione, Tipo.parto, Sesso

The multicollinearity coefficients are under 5, so the model assumption on multicollinearity is respected.

General Overview

plot(final_model)

A. Residuals vs Fitted

The residuals plot is consistent with a random distribution around zero

B. Q-Q Plot

The points should be around the bisector, and the results have some problems at the extreme with the model that is not consistent at the extreme part, probably due to some outliers

C. Scale-Location

The points are distributied randomly around a constant between 1 and 0.5.

D. Residuals vs. Leverage

Here is possible to see that 3 values are outliers. Where 310 and 1429 are close to the 0.5 limit, so they shouldn’t be a problem for the model; but 1551 is outside of the threshold of 1, so can have an influence on the model.

Here the outliers:

kable(df[1549, ])
Anni.madre N.gravidanze Fumatrici Gestazione Peso Lunghezza Cranio Tipo.parto Ospedale Sesso
1551 35 1 Not a Smoker 38 4370 315 374 Nat osp3 F

It seems a normal datapoint, where a non smoker mother was able to do a Natural delivery at the end of the pregnancy itself. The weight is above way bigger than the female average of 3161 and this can be the reason. Some pregnancy that lead to a newborn’s with an high weight exists and can also be due to different reason as well as diseases, so it’s correct to have it in the dataset even if can influence the model.

Residuals

ggplot()+
  geom_density(aes(residuals(final_model)))+
  labs(title = "Final Model: Density plot of the Residuals", y = "Density", x = "Residuals")+
  theme_minimal()+theme(plot.title = element_text(hjust = 0.5))

shap = shapiro.test(residuals(final_model))
bp = bptest(final_model)
dw = dwtest(final_model)

Tests <- c("Shapiro Test", "Breusch-Pagan Test", "Durbin-Watson Test")
Pvalue <- c(round(shap$p.value, 3), round(bp$p.value, 3), round(dw$p.value, 3))

recap_dataframe <- data.frame(Test = Tests, pvalue = Pvalue)
kable(recap_dataframe)
Test pvalue
Shapiro Test 0.000
Breusch-Pagan Test 0.000
Durbin-Watson Test 0.085

From the result above it’s easy to understand that there are some problems with the model.

Outliers

A brief study of the outliers can explain what is happening.

lev = hatvalues(final_model)
p = sum(lev)
n = nrow(df)
Threshold = 2* p/n

plot(lev, main = "Leverage values", ylab = "Leverage values", xlab = "Index", pch = 1) 
abline(h = Threshold, col = "red", lwd = 1, lty = 2)
text(x = 2200,  
     y = 0.58,  
     labels = paste("Number of outliers =", length(lev[lev > Threshold])),
     col = "black", font = 2, cex = 0.8)
legend(
  "topleft",                    
  legend = c("Leverage", "Threshold"),
  col = c("black", "red"),          
  pch = c(1, NA),                
  lty = c(NA, 2),                  
  lwd = c(NA, 1),                  
  bty = "n")

From the leverage plot, can already be seen that there are more than 200 outliers.

plot(rstudent(final_model), main = "Residuals rinormalized", ylab = "Studentize residuals")
abline(h=c(-2,2), col = "red")
legend(
  "topleft",                    
  legend = c("Residuals", "Threshold"),
  col = c("black", "red"),          
  pch = c(1, NA),                
  lwd = c(NA, 1),                  
  bty = "n")

Also through the tstudent ri-normalization it clear that the residuals have outliers. So the outliers can be studied further.

detach(df)
outliers = which(lev < Threshold)

df["Leverage value"] = lev


ggplot(data = df)+
  geom_point(aes(x = Fumatrici, y = `Leverage value`))+
  labs( title = "Leverage values by Smoker habits", x = "Smoker", y = "Leverage Value")+
    theme_classic()+
    theme(plot.title = element_text(hjust = 0.5))

Smoking create almost every time an “out of standard” residuals of the newborn’s weight, enough to influence the regression model to be unpredictable.

df_outliers = df[-outliers, ]

ggplot(data = df_outliers)+
  geom_col(aes(x = Fumatrici), stat = "Count", fill = "5A7FA8", col = "black")+
  labs( title = "Count of data with Leverage value above threshold", x = "Smoking", y = "Count")+
    theme_classic()+
    theme(plot.title = element_text(hjust = 0.5))

notoutliers = which(lev > Threshold)
df_clean = df[-notoutliers,]


ggplot(data = df_clean)+
  geom_col(aes(x = Fumatrici), stat = "Count", fill = "5A7FA8", col = "black")+
  labs( title = "Count of data with Leverage value below threshold", x = "Smoking", y = "Count")+
    theme_classic()+
    theme(plot.title = element_text(hjust = 0.5))

This last two column plots confirm that a smoker is always out of standard, for this dataset at least, influencing the regression model.

Let’s see if a model with only non smoker datapoints reflect better the requirment for a regression model.

df_non_smoker <- df[df$Fumatrici == "Not a Smoker", ]
df_non_smoker$Fumatrici = NULL
saturated_model_nonsmok = lm(Peso ~. - `Leverage value`, data = df_non_smoker)
summary(saturated_model_nonsmok)
## 
## Call:
## lm(formula = Peso ~ . - `Leverage value`, data = df_non_smoker)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1113.79  -181.67   -14.27   162.97  2577.80 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -6736.1069   143.3650 -46.986  < 2e-16 ***
## Anni.madre        0.7243     1.1739   0.617  0.53728    
## N.gravidanze     12.5362     4.7702   2.628  0.00864 ** 
## Gestazione       34.5836     3.8885   8.894  < 2e-16 ***
## Lunghezza        10.1097     0.3062  33.018  < 2e-16 ***
## Cranio           10.4955     0.4340  24.183  < 2e-16 ***
## Tipo.partoNat    33.1134    12.3394   2.684  0.00733 ** 
## Ospedaleosp2    -10.0486    13.7665  -0.730  0.46551    
## Ospedaleosp3     33.2505    13.8054   2.409  0.01609 *  
## SessoM           79.7595    11.4224   6.983 3.74e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 274.3 on 2384 degrees of freedom
## Multiple R-squared:  0.7302, Adjusted R-squared:  0.7292 
## F-statistic:   717 on 9 and 2384 DF,  p-value: < 2.2e-16

The r-squared is similar to the previous one, with a value of 0.73.

To choose the final model for non smokers is taken in consideration:

attach(df_non_smoker)

modelcomparison_2 = function(model_A, model_B){
  dataanova = anova(model_A, model_B)
  pvalue = dataanova$`Pr(>F)`[2]
  
  aic_a = AIC(model_A)
  bic_a = BIC(model_A)

  aic_b = AIC(model_B)
  bic_b = BIC(model_B)
  
  summ_a = summary(model_A)
  summ_b = summary(model_B)
  r_a = round(summ_a$r.squared, 3)
  r_b = round(summ_b$r.squared, 3)
  
  mse_model_a = mean(model_A$residuals^2)
  rmse_model_a = sqrt(mse_model_a)
  
  mse_model_b = mean(model_B$residuals^2)
  rmse_model_b = sqrt(mse_model_b)
  
  model <- c("Saturated Model", "Final Model")
  AIC <- c(aic_a, aic_b)
  BIC <- c(bic_a, bic_b)
  R <- c(r_a, r_b) 
  RMSE_value <- c(rmse_model_a, rmse_model_b)
  anovatest <- c("", round(pvalue, 3))
  
  tablecomparison <- data.frame(Models = model, AIC_value = AIC, BIC_value = BIC, R_squared = R, RMSE =  RMSE_value, ANOVA_pvalue = anovatest)
  
  return(tablecomparison)
}




final_model_nosmok = update(saturated_model_nonsmok, ~. - Tipo.parto - Ospedale - Lunghezza - Cranio + Tipo.parto * Anni.madre + I(Lunghezza**2) + I(Cranio**2) )

kable(modelcomparison_2(saturated_model_nonsmok, final_model_nosmok))
Models AIC_value BIC_value R_squared RMSE ANOVA_pvalue
Saturated Model 33687.07 33750.66 0.730 273.7448
Final Model 33637.67 33695.48 0.736 271.0479 NA

The final model for the non smoker has a better BIC and AIC parameter and a slightly better R squared around 0.74.

With the same logic it’s possible to create a model for the mothers that smoke.

detach(df_non_smoker)

df_smoker <- df[df$Fumatrici == "Smoker", ]
df_smoker$Fumatrici = NULL
attach(df_smoker)

saturated_model_smok = lm(Peso ~. - `Leverage value`, data = df_smoker)

final_model_smok = update(saturated_model_smok, ~.   - Tipo.parto - Ospedale - Lunghezza - Cranio + Tipo.parto * Anni.madre + I(Lunghezza**2) + I(Cranio**2) )

detach(df_smoker)
kable(modelcomparison_2(saturated_model_smok, final_model_smok))
Models AIC_value BIC_value R_squared RMSE ANOVA_pvalue
Saturated Model 1453.224 1482.312 0.756 235.5555
Final Model 1449.915 1476.359 0.759 234.0780 NA
shap_nonsmok = shapiro.test(residuals(final_model_nosmok))
bp_nonsmok = bptest(final_model_nosmok)
dw_nonsmok = dwtest(final_model_nosmok)

shap_smok = shapiro.test(residuals(final_model_smok))
bp_smok = bptest(final_model_smok)
dw_smok = dwtest(final_model_smok)


model <- c("Not smoker model", "Smoker Model")
Shapiro_Test <- c(round(shap_nonsmok$p.value, 3), round(shap_smok$p.value, 3))
BP_Test <- c(round(bp_nonsmok$p.value, 3), round(bp_smok$p.value, 3))
DW_Test <- c(round(dw_nonsmok$p.value, 3), round(dw_smok$p.value, 3)) 
  
tabletests <- data.frame(Models = model, Shap_test = Shapiro_Test, BP_test = BP_Test, DW_test = DW_Test)
kable(tabletests)
Models Shap_test BP_test DW_test
Not smoker model 0.000 0.000 0.037
Smoker Model 0.361 0.062 0.788

Not Smoking Model:

  1. Shapiro test has a p-value below threshold, so the null hypothesis is rejected and the residuals does not follow a normal distribution
  2. Breusch-Pagan Test has a p-value below threshold, so the null hypothesis is rejected and indicates heteroscedasticity
  3. Durbin-Watson Test has a p-value slightly below the threshold of 0.05 and indicates that true autocorrelation is greater than 0
plot(final_model_nosmok)

Smoking Model:

  1. Shapiro test has a p-value above threshold, so the null hypothesis is not rejected and the residuals does follow a normal distribution
  2. Breusch-Pagan Test has a p-value above threshold, so the null hypothesis is not rejected and indicates homoscedasticity (residuals have a constant variance)
  3. Durbin-Watson Test has a p-value slightly above the threshold of 0.05 and indicates that there isn’t autocorrelation
plot(final_model_smok)

3 Prediction Process

  • General Model
col_to_keep = c("N.gravidanze", "Fumatrici", "Gestazione", "Tipo.parto", "Sesso", "Cranio", "Lunghezza")

new_data = df[, col_to_keep]
  
prevision = predict(final_model, newdata = new_data)
  
new_data["Peso"] = prevision

new_data["Category"] = "Prediction"

df["Category"] = "Actual"
df_bind = df[, colnames(new_data)]

df_totale <- rbind(df_bind, new_data)

ggplot(data = df_totale, aes(x = Peso, fill = Category)) +
  geom_density(alpha = 0.4) +
  theme_minimal()+
  labs(title = "Distribution comparison: Actual vs Simulated - Original Dataset",
       x = "Weight", y = "Density")+
    theme(plot.title = element_text(hjust = 0.5))

With the training data the model seems stable and predict similar value to the actual model.

Now to test if the model is stable with a different dataset we can use the bootstrapping method. A new dataframe is create with samples of the row of the initial dataset so that the proportion and the underlying natural laws are conserved, but it’s still possible to see if the model can handle new dataset.

df_train = df[,col_to_keep]
new_data <- df[sample(nrow(df_train), size = nrow(df_train) * 2, replace = TRUE), ]

prediction <- predict(final_model, newdata = new_data)
new_data["Peso"] = prediction

new_data["Category"] = "Prediction"
df["Category"] = "Actual"
df_bind = df[, colnames(new_data)]

df_totale <- rbind(df_bind, new_data)

ggplot(data = df_totale, aes(x = Peso, fill = Category)) +
  geom_density(alpha = 0.4) +
  theme_minimal()+
  labs(title = "Distribution comparison: Actual vs Simulated - Bootstrapping",
       x = "Weight", y = "Density")+
    theme(plot.title = element_text(hjust = 0.5))

The model is still stable even with the new dataset generated from the training data. With the same logic is possible to generate a new synthetic dataset that reflect the underlying natural laws, so that there isn’t possibility of impossible combination of, for example, Cranium size, lenght and newborn’s weight.

new_data <- mysynt$syn

prediction <- predict(final_model, newdata = new_data)

new_data["Peso"] = prediction

new_data["Category"] = "Prediction"
df["Category"] = "Actual"
df_bind = df[, colnames(new_data)]

df_totale <- rbind(df_bind, new_data)

ggplot(data = df_totale, aes(x = Peso, fill = Category)) +
  geom_density(alpha = 0.4) +
  theme_minimal()+
  labs(title = "Distribution comparison: Actual vs Simulated - Synthetic Dataset",
       x = "Weight", y = "Density")+
    theme(plot.title = element_text(hjust = 0.5))

The model remain stable and accurate with the synthetic dataset too.

  • Non Smoker Model
col_to_keep = c("N.gravidanze", "Anni.madre", "Gestazione", "Tipo.parto", "Sesso", "Cranio", "Lunghezza")

new_data = df_non_smoker[, col_to_keep]
  
prevision = predict(final_model_nosmok, newdata = new_data)
  
new_data["Peso"] = prevision

new_data["Category"] = "Prediction"

df_non_smoker["Category"] = "Actual"
df_bind = df_non_smoker[, colnames(new_data)]

df_totale <- rbind(df_bind, new_data)

ggplot(data = df_totale, aes(x = Peso, fill = Category)) +
  geom_density(alpha = 0.4) +
  theme_minimal()+
  labs(title = "Distribution comparison: Actual vs Simulated - Original Dataset",
       x = "Weight", y = "Density")+
    theme(plot.title = element_text(hjust = 0.5))

The model seems stable and predict similar value to the actual dataset

Bootstrapping method.

df_train = df_non_smoker[,col_to_keep]
new_data <- df_non_smoker[sample(nrow(df_train), size = nrow(df_train) * 2, replace = TRUE), ]

prediction <- predict(final_model_nosmok, newdata = new_data)
new_data["Peso"] = prediction

new_data["Category"] = "Prediction"
df_non_smoker["Category"] = "Actual"
df_bind = df_non_smoker[, colnames(new_data)]

df_totale <- rbind(df_bind, new_data)

ggplot(data = df_totale, aes(x = Peso, fill = Category)) +
  geom_density(alpha = 0.4) +
  theme_minimal()+
  labs(title = "Distribution comparison: Actual vs Simulated - Bootstrapping",
       x = "Weight", y = "Density")+
    theme(plot.title = element_text(hjust = 0.5))

The model is still stable even with the new dataset generated from the training data. As before now it’s possible to study the model with a synthetic dataset.

new_data <- mysynt$syn

prediction <- predict(final_model_nosmok, newdata = new_data)

new_data["Peso"] = prediction

new_data["Category"] = "Prediction"
df_non_smoker["Category"] = "Actual"
df_bind = df_non_smoker[, colnames(new_data)]

df_totale <- rbind(df_bind, new_data)

ggplot(data = df_totale, aes(x = Peso, fill = Category)) +
  geom_density(alpha = 0.4) +
  theme_minimal()+
  labs(title = "Distribution comparison: Actual vs Simulated - Synthetic Dataset",
       x = "Weight", y = "Density")+
    theme(plot.title = element_text(hjust = 0.5))

The model remain stable and accurate with the synthetic dataset too.

  • Smoker Model
col_to_keep = c("N.gravidanze", "Anni.madre", "Gestazione", "Tipo.parto", "Sesso", "Cranio", "Lunghezza")

new_data = df_smoker[, col_to_keep]
  
prevision = predict(final_model_smok, newdata = new_data)
  
new_data["Peso"] = prevision

new_data["Category"] = "Prediction"

df_smoker["Category"] = "Actual"
df_bind = df_smoker[, colnames(new_data)]

df_totale <- rbind(df_bind, new_data)

ggplot(data = df_totale, aes(x = Peso, fill = Category)) +
  geom_density(alpha = 0.4) +
  theme_minimal()+
  labs(title = "Distribution comparison: Actual vs Simulated - Original Dataset",
       x = "Weight", y = "Density")+
    theme(plot.title = element_text(hjust = 0.5))

df_train = df_smoker[,col_to_keep]
new_data <- df_smoker[sample(nrow(df_train), size = nrow(df_train) * 2, replace = TRUE), ]

prediction <- predict(final_model_nosmok, newdata = new_data)
new_data["Peso"] = prediction

new_data["Category"] = "Prediction"
df_smoker["Category"] = "Actual"
df_bind = df_smoker[, colnames(new_data)]

df_totale <- rbind(df_bind, new_data)

ggplot(data = df_totale, aes(x = Peso, fill = Category)) +
  geom_density(alpha = 0.4) +
  theme_minimal()+
  labs(title = "Distribution comparison: Actual vs Simulated - Bootstrapping",
       x = "Weight", y = "Density")+
    theme(plot.title = element_text(hjust = 0.5))

new_data <- mysynt$syn

prediction <- predict(final_model_smok, newdata = new_data)

new_data["Peso"] = prediction

new_data["Category"] = "Prediction"
df_smoker["Category"] = "Actual"
df_bind = df_smoker[, colnames(new_data)]

df_totale <- rbind(df_bind, new_data)

ggplot(data = df_totale, aes(x = Peso, fill = Category)) +
  geom_density(alpha = 0.4) +
  theme_minimal()+
  labs(title = "Distribution comparison: Actual vs Simulated - Synthetic Dataset",
       x = "Weight", y = "Density")+
    theme(plot.title = element_text(hjust = 0.5))

Similar to the previous model, the Smoker model, is stable with the three different prediction simulation.

4 Variable importance

Now it’s possible to understand how the different variable impact the linear regression model

4.1 Gender

ggplot(data = df)+
  geom_point(aes(x = Lunghezza * Cranio,
                 y = Peso,
                 col = Sesso), size = 0.8, position = "jitter")+
  geom_smooth(aes(x = Lunghezza * Cranio,
                  y = Peso,
                  col = Sesso), se = FALSE, method = "lm")+  
  theme_minimal()+
  labs(title = "Newborn's Measurments: Linear Regression by Gender",
       x = "Body Length * Cranium Size", y = "Weight", color = "Gender")+
    theme(plot.title = element_text(hjust = 0.5))

The gender variable determine two different regression where the male ones have higher measurments, but with same slope, so same relation between the three variable, as expected.

ggplot(data = df)+
  geom_point(aes(x = Gestazione,
                 y = Peso,
                 col = Sesso), size = 0.8, position = "jitter")+
  geom_smooth(aes(x = Gestazione,
                  y = Peso,
                  col = Sesso), se = FALSE, method = "lm")+  
  theme_minimal()+
  labs(title = "Number of pregnancy month: Linear Regression by Gender",
       x = "# Months", y = "Weight", color = "Gender")+
    theme(plot.title = element_text(hjust = 0.5))

The same behavior can be seen with the duration of the pregnancy.

Also for both charts is it possible to see that the male cluster is above the female one in general, as expected.

4.3 Mother’s Age

Another important factor for a good pregnancy is often the mother’s age.

df$MotAge <- cut(
  df$Anni.madre,
  breaks = c(0, 18, 30, max(df$Anni.madre)),
  labels = c("Minors", "18-30", "30+"),
  right = TRUE,  
  include.lowest = TRUE
)


ggplot(data = df)+
  geom_point(aes(x = Lunghezza * Cranio,
                 y = Peso,
                 col = MotAge), size = 0.8, position = "jitter", alpha = 1/5)+
  geom_smooth(aes(x = Lunghezza * Cranio,
                  y = Peso,
                  col = MotAge), se = FALSE, method = "lm", linewidth = 0.5)+  
  theme_minimal()+
  labs(title = "Newborn's Measurments: Linear Regression by Mother's age",
       x = "Body Length * Cranium Size", y = "Weight", color = "Age Range")+
    theme(plot.title = element_text(hjust = 0.5))+xlim(135000,190000)+ylim(2000,4000)

Different age lead to different slope, for example:

  • The relation between the three physical measurements for a minor mother has the smaller slope, so a newborn’s with bigger cranium or body length can lead to a smaller body weight compared to an older mom, this, probably, is because a young mother has a less stable and prepared body.
  • A mother in the range of 30+ age has the highest slope, so smaller change can lead to bigger impact on the newborn’s weight, so an older mother is more unpredictable.
ggplot(data = df)+
  geom_point(aes(x = Gestazione,
                 y = Peso,
                 col = MotAge), size = 0.8, position = "jitter")+
  geom_smooth(aes(x = Gestazione,
                  y = Peso,
                  col = MotAge), se = FALSE, method = "lm")+  
  theme_minimal()+
  labs(title = "Number of pregnancy month: Linear Regression by Mother's age",
       x = "# Months", y = "Weight", color = "Age range")+
    theme(plot.title = element_text(hjust = 0.5))

Considering the number of months of the pregnancy the same behavior is highlighted. The most stable and reliable age range is between 18 and 30.

4.3 Smoking habits

 g1 = ggplot(data = df)+
  geom_point(aes(x = Lunghezza * Cranio,
                 y = Peso,
                 col = Fumatrici), size = 0.8, position = "jitter")+
  geom_smooth(aes(x = Lunghezza * Cranio,
                  y = Peso,
                  col = Fumatrici), se = FALSE, method = "lm")+  
  theme_minimal()+ylim(3100, 4100)+xlim(150000, 190000)+labs(color = "Smoking Habits")+xlab(NULL)+ylab(NULL)

 g2 = ggplot(data = df)+
  geom_point(aes(x = Lunghezza * Cranio,
                 y = Peso,
                 col = Fumatrici), size = 0.8, position = "jitter")+
  geom_smooth(aes(x = Lunghezza * Cranio,
                  y = Peso,
                  col = Fumatrici), se = FALSE, method = "lm")+  
  theme_minimal()+
  labs(title = "Newborn's Measurments: Linear Regression by smoking habits",
       x = "Body Length * Cranium Size", y = "Weight", color = "Smoking habits")+
    theme(plot.title = element_text(hjust = -0.3), legend.position = "none", 
          axis.title.x = element_text(hjust = 4,  vjust = -1))
 
 g2 | g1

Instead of gender, smoking habits influence the relation between Cranium size and body length with the newborn’s weight. The slope is higher in the smoker cluster, and taken as normal the non smoker cluster, this means that with smaller changes the output can change faster, so this means more unpredictability.

ggplot(data = df)+
  geom_point(aes(x = Gestazione,
                 y = Peso,
                 col = Fumatrici), size = 0.8, position = "jitter")+
  geom_smooth(aes(x = Gestazione,
                  y = Peso,
                  col = Fumatrici), se = FALSE, method = "lm")+  
  theme_minimal()+
  labs(title = "Number of pregnancy months: Linear Regression by smoking habits",
       x = "# months", y = "Weight", color = "Smoking habits")+
    theme(plot.title = element_text(hjust = 0.5))+xlim(30, max(df$Gestazione))

The image above highlight perfectly the impact of smoking. More month of pregnancy doesn’t lead to an increase body weight. So smoking can be the reason of underweight babies. It’s also clear from the chart, that the datapoint of the smoking cluster is at the edge of the non smoking group, with less point above the pink regression line and more under the line.

5 Conclusion

This project lead to some very important conclusion highlighting the impact of different variable to the prediction of the newborn’s baby. The most important output was understanding the real impact of smoking habits; smoking lead to an unpredictable pregnancy, so Neonatal Health Solutions should invest in preventing smoking during the pregnancy. Another important component is the age of the mother, at a very young age the body of the mother is not ready for a long pregnancy, so the prevention of unwanted pregnancies and a good care of the health of the body of the mother can helps pregnancy at a very young age; meanwhile at an older age (30+) the first months of the pregnancy seem to be the most dangerous, so the prevention and care should start at the very beginning of the pregnancy. Also, from the models, the body measurements of the newborn’s can be a very important indicator of the future weight of the baby, so Neonatal Health Solutions should invest in better equipment to measure body length and cranium size.