library(modelsummary)
library(patchwork)
library(dplyr)
library(ggplot2)
library(moments)
library(lmtest)
library(lubridate)
library(knitr)
library(gghalves)
library(RColorBrewer)
library(MASS)
library(car)
library(synthpop)
df = read.csv("neonati.csv", stringsAsFactors = TRUE)
df$Fumatrici <- factor(df$Fumatrici,
levels = c(0, 1),
labels = c("Not a Smoker", "Smoker"))
kable(summary(df), caption = "Summary of the dataset")
| Anni.madre | N.gravidanze | Fumatrici | Gestazione | Peso | Lunghezza | Cranio | Tipo.parto | Ospedale | Sesso | |
|---|---|---|---|---|---|---|---|---|---|---|
| Min. : 0.00 | Min. : 0.0000 | Not a Smoker:2396 | Min. :25.00 | Min. : 830 | Min. :310.0 | Min. :235 | Ces: 728 | osp1:816 | F:1256 | |
| 1st Qu.:25.00 | 1st Qu.: 0.0000 | Smoker : 104 | 1st Qu.:38.00 | 1st Qu.:2990 | 1st Qu.:480.0 | 1st Qu.:330 | Nat:1772 | osp2:849 | M:1244 | |
| Median :28.00 | Median : 1.0000 | NA | Median :39.00 | Median :3300 | Median :500.0 | Median :340 | NA | osp3:835 | NA | |
| Mean :28.16 | Mean : 0.9812 | NA | Mean :38.98 | Mean :3284 | Mean :494.7 | Mean :340 | NA | NA | NA | |
| 3rd Qu.:32.00 | 3rd Qu.: 1.0000 | NA | 3rd Qu.:40.00 | 3rd Qu.:3620 | 3rd Qu.:510.0 | 3rd Qu.:350 | NA | NA | NA | |
| Max. :46.00 | Max. :12.0000 | NA | Max. :43.00 | Max. :4930 | Max. :565.0 | Max. :390 | NA | NA | NA |
Age of the mother smaller than 12-14 is probably errors in the dataset, so here the cleaned dataset. Even if 12-14 is still a very young age it happens, so it’s going to be keep in the dataset.
df <- df[df$Anni.madre > 12, ]
attach(df)
kable(summary(df), caption = "Summary of the dataset")
| Anni.madre | N.gravidanze | Fumatrici | Gestazione | Peso | Lunghezza | Cranio | Tipo.parto | Ospedale | Sesso | |
|---|---|---|---|---|---|---|---|---|---|---|
| Min. :13.00 | Min. : 0.0000 | Not a Smoker:2394 | Min. :25.00 | Min. : 830 | Min. :310.0 | Min. :235 | Ces: 728 | osp1:816 | F:1255 | |
| 1st Qu.:25.00 | 1st Qu.: 0.0000 | Smoker : 104 | 1st Qu.:38.00 | 1st Qu.:2990 | 1st Qu.:480.0 | 1st Qu.:330 | Nat:1770 | osp2:848 | M:1243 | |
| Median :28.00 | Median : 1.0000 | NA | Median :39.00 | Median :3300 | Median :500.0 | Median :340 | NA | osp3:834 | NA | |
| Mean :28.19 | Mean : 0.9816 | NA | Mean :38.98 | Mean :3284 | Mean :494.7 | Mean :340 | NA | NA | NA | |
| 3rd Qu.:32.00 | 3rd Qu.: 1.0000 | NA | 3rd Qu.:40.00 | 3rd Qu.:3620 | 3rd Qu.:510.0 | 3rd Qu.:350 | NA | NA | NA | |
| Max. :46.00 | Max. :12.0000 | NA | Max. :43.00 | Max. :4930 | Max. :565.0 | Max. :390 | NA | NA | NA |
Explanatory variable:
The qualitative variable are converted to factor, so that can be used as class during the analysis.
Target Variable:
gini_index_func = function(col){
n_i = table(col)
f_i_2 = (table(col)/length(col)) ^ 2
j = length(table(col))
gini = 1 - sum(f_i_2)
gini_norm = gini/((j - 1)/j)
return(gini_norm)}
density_form_con = function(column, label_name, y_cord, y_cord_2){
kurt = round(kurtosis({{column}}) - 3, 1)
fis = round(skewness({{column}}), 1)
string_1 = paste("Kurtosis Index = ", kurt)
string_2 = paste("Fischer Index = ", fis)
plot_title = paste0("Newborn's ", label_name, " Distribution" )
density_value = density({{column}})
g1 = ggplot(data = df)+
geom_half_violin(aes(y = {{column}}),side = "left", col = "#5A7FA8", linewidth =
0.7)+
geom_half_boxplot(aes(y = {{column}}), fill = "#5A7FA8")+
labs(title = plot_title, x = label_name, y = "")+
theme_classic()+
theme(plot.title = element_text(hjust = 0.5))+
annotate("label", label = string_1, x = 0.7, y = y_cord, fill = "white", vjust = 1,
hjust = 1)+
annotate("label", label = string_2, x = 0.7, y = y_cord_2,
fill = "white",
vjust = 1, hjust = 1)
return(g1)
}
plt_cra = density_form_con(Cranio, "Cranium diameter", 400, 388)
plt_cra
The newborn’s Cranium diameter has an high peak; the shape is capture by the Kurtosis index that is greater than 0. The Fischer index suggest correctly the asymmetry of the newborn’s Cranium diameter; with a negative asymmetry. From the boxplot is also possible to see various outliers.
plt_len = density_form_con(Lunghezza, "Length (mm)", 560, 542)
plt_len
The newborn’s length has a more elongated shape, with an higher peak. This shape it’s also capture by the Kurtosis index that is greater than 0. The same logic of before can be applied for the Fischer index that capture the negative asymmetry of the distribution. Similar to the Cranium size the boxplot has various outlier.
plt_wei = density_form_con(Peso, "weight", 5000, 4720)
plt_wei
The target variable has a kurtosis index equal to 2, that means a more elongated shape and an higher peak in comparison with the normal distribution. Fischer index is slightly negative, so the shape is similar to a normal distribution with a small negative asymmetry.
shap = shapiro.test(Peso)
kable(paste("The target variable does not have a normal distribution, this can be seen from the p-value of the shapiro test, because is smaller than 0.05 and so the null hypothesis is rejected: p-value = ", round(shap$p.value,3)), col.names = "")
| The target variable does not have a normal distribution, this can be seen from the p-value of the shapiro test, because is smaller than 0.05 and so the null hypothesis is rejected: p-value = 0 |
density_form_dis = function(column, label_name, y_cord){
kurt = round(kurtosis({{column}}) - 3, 1)
fis = round(skewness({{column}}), 1)
string_1 = paste("Kurtosis Index = ", kurt)
string_2 = paste("Fischer Index = ", fis)
plot_title = paste0(label_name, " Distribution" )
g1 = ggplot(data = df)+
geom_boxplot(aes(y = {{column}}), fill = "#5A7FA8")+
labs(title = plot_title, y = label_name, x = "Count")+
theme_classic()+
theme(plot.title = element_text(hjust = 0.5))+
annotate("label", label = string_1, x = 0.7, y = 45, fill = "white", vjust = 1,
hjust = 1)+
annotate("label", label = string_2, x = 0.7, y = y_cord, fill = "white", vjust = 1,
hjust = 1)
g2 = ggplot(data = df)+
geom_col(aes(x = {{column}}),stat = "count",col = "black" , fill = "#5A7FA8",
linewidth = 0.7)+
labs(title = plot_title, x = label_name, y = "")+
theme_classic()+
theme(plot.title = element_text(hjust = 0.5))
return(g1 | g2)
}
plt_ges = density_form_dis(Gestazione, "Month of Pregnancy", 43.5)
plt_ges
The month of pregnancy has the same asymmetry and shape as “Lughezza” and “Cranio”, this behavior is very clear from the two index evaluate. From the boxplot is also possible to see some outliers especially in the lower range of the variable.
plt_age = density_form_dis(Anni.madre, "Mother Age", 41.5)
plt_age
Instead the age distribution has a shape equal to the normal distribution. The bell shape is reflected also in the index that are very close to 0.
ggplot()+
geom_col(aes(x = N.gravidanze), stat = "count", col = "black", fill = "#5A7FA8")+
labs(title = "# Pregnancy Distribution", x = "# Pregnancy", y = "Count", caption = "Caption: Visualization of the frequency distribution for the number of pregnancy")+
theme_classic()+
theme(plot.title = element_text(hjust = 0.5),
plot.caption = element_text(hjust = 0.45,vjust = -1,face = "bold",size = 10, colour = "black" ))
Qualitative Variable
Below are plotted the frequency for all the qualitative variable. It’s also evaluated the Gini Index, where an Index near 1 means that the variable is evenly distributed across the classes; instead a Gini index equal to zero is related to a variable that has one dominant class.
viz_qual_func = function(x, col_label, y_cord,x_cord, h_cord, v_cord, size_val){
gini_string = paste("Gini Index = ", round(gini_index_func({{x}}) , 1))
title_string = paste("Frequency Distribution: ", col_label)
g1 = ggplot()+
geom_col(aes( x = {{x}}), stat = "count", fill = "#5A7FA8", col = "black")+
labs( title = title_string, x = col_label, y = "Count")+
theme_classic()+
theme(plot.title = element_text(hjust = 0.5))+
annotate("label", label = gini_string,
x = x_cord, y = y_cord,
vjust = v_cord,
hjust = h_cord,
alpha = 0.3, size = size_val)
return(g1)
}
plot_fum = viz_qual_func(Fumatrici, "Smoker", Inf, Inf, 1.1, 1, 4)
plot_fum
plot_del = viz_qual_func(Tipo.parto, "Cesarean or Natural", Inf, -Inf, -0.1, 1, 4)
plot_del
plot_ho = viz_qual_func(Ospedale, "Hospital", Inf, Inf, 1, 1, 3.2)
plot_gen = viz_qual_func(Sesso, "Male or Female", Inf, Inf, 1.1, 1, 3.2)
plot_ho | plot_gen
The dataset is evenly distributed between the Hospitals and the newborn’s gender; also the mode of delivery is evenly distributed with a Gini index of 0.8 and a slight predominance of the Natural delivery. The First figure tells us that the mother are almost everytime non smoker in this dataset.
Let’s first visualize the frequency table observed between the three hospitals.
ggplot()+
geom_col(aes(x = Tipo.parto, fill = Ospedale), stat = "count", position = "dodge")+
labs(title = "Mode of delivery by hospital", x = "Mode of delivery", y = "Count")+
theme_classic()+theme(plot.title = element_text(hjust = 0.5))
To reject or not the null hypothesis a chi-squared test can be perform.
chi = chisq.test(table(Ospedale, Tipo.parto))
kable(paste("Chi-Squared test p-value = ", round(chi$p.value, 2) , ". So the null hypothesis is rejected and from the dataset the number of cesarean section is indipendent from the hospital"), col.names = "")
| Chi-Squared test p-value = 0.58 . So the null hypothesis is rejected and from the dataset the number of cesarean section is indipendent from the hospital |
WEIGHT:
The known average weight of the newborn’s is 3300g.
test_data =t.test(Peso, mu = 3300, conf.level = 0.95)
kable(paste("Taken the population mean for the newborn's weight as 3300g a T-test is performed with a confidence interval of 95% considering both side for the t-test. The p-value is equal to ", round(test_data$p.value, 3), ". So the null hypothesis is rejected and the mean is not equal to the population mean."), col.names = "")
| Taken the population mean for the newborn’s weight as 3300g a T-test is performed with a confidence interval of 95% considering both side for the t-test. The p-value is equal to 0.132 . So the null hypothesis is rejected and the mean is not equal to the population mean. |
kable(paste("However the population mean is within the confidence level of [", round(test_data$conf.int[1],2),round(test_data$conf.int[2], 2), "]. So the difference can be attributed to the data selection."), col.names = "")
| However the population mean is within the confidence level of [ 3263.58 3304.79 ]. So the difference can be attributed to the data selection. |
kable("Considering the confindence level analysis the newborn's weight of the dataset is significantly equal to the population mean.", col.names = "")
| Considering the confindence level analysis the newborn’s weight of the dataset is significantly equal to the population mean. |
LENGHT:
The known average weight of the newborn’s is 500mm.
test_data =t.test(Lunghezza, mu = 500, conf.level = 0.95)
kable(paste("Taken the population mean for the newborn's length as 500mm a T-test is performed with a confidence interval of 95% considering both side for the t-test. The p-value is equal to ", round(test_data$p.value, 3), ". So the null hypothesis is rejected and the mean is not equal to the population mean."), col.names = "")
| Taken the population mean for the newborn’s length as 500mm a T-test is performed with a confidence interval of 95% considering both side for the t-test. The p-value is equal to 0 . So the null hypothesis is rejected and the mean is not equal to the population mean. |
kable(paste("The population mean is also outside the confidence interval of [", round(test_data$conf.int[1],2),round(test_data$conf.int[2], 2), "]. So the difference can not be attributed to the data selection."), col.names = "")
| The population mean is also outside the confidence interval of [ 493.66 495.73 ]. So the difference can not be attributed to the data selection. |
kable("The newborn's length of the dataset is significantly different to the population mean.", col.names = "")
| The newborn’s length of the dataset is significantly different to the population mean. |
To answer a t-test is performed, with a confidence level of 95% and considering both side of the distribution. The t-test is performed considering the various measurements in relation with the gender.
test_data = t.test(data = df, Peso ~ Sesso)
ggplot()+
geom_boxplot(aes(y = Peso, x = Sesso, fill = Sesso))+
labs(title = "Weight Distribution by gender", x = "Gender", y = "Weight")+
theme_classic()+theme(plot.title = element_text(hjust = 0.5))+
annotate("label", label = paste("Mean = ", round(test_data$estimate[2], 2)),
x = 2.3, y = 1000,
alpha = 0.3)+
annotate("label", label = paste("Mean = ", round(test_data$estimate[1], 2)),
x = 0.7, y = 1000,
alpha = 0.3)
kable(paste("p-value = ", round(test_data$p.value,3), ". So the null hypotesis is rejected and the weight mean between the two gender is statistically different."), col.names = "")
| p-value = 0 . So the null hypotesis is rejected and the weight mean between the two gender is statistically different. |
test_data = t.test(data = df, Lunghezza ~ Sesso)
ggplot()+
geom_boxplot(aes(y = Lunghezza, x = Sesso, fill = Sesso))+
labs(title = "Length Distribution by gender", x = "Gender", y = "Weight")+
theme_classic()+theme(plot.title = element_text(hjust = 0.5))+
annotate("label", label = paste("Mean = ", round(test_data$estimate[2], 2)),
x = 2.3, y = 300,
alpha = 0.3)+
annotate("label", label = paste("Mean = ", round(test_data$estimate[1], 2)),
x = 0.7, y = 300,
alpha = 0.3)
kable(paste("p-value = ", test_data$p.value, ". So the null hypotesis is rejected and the length mean between the two gender is statistically different."), col.names = "")
| p-value = 2.23232824860459e-21 . So the null hypotesis is rejected and the length mean between the two gender is statistically different. |
test_data = t.test(data = df, Cranio ~ Sesso)
ggplot()+
geom_boxplot(aes(y = Cranio, x = Sesso, fill = Sesso))+
labs(title = "Cranium size Distribution by gender", x = "Gender", y = "Weight")+
theme_classic()+theme(plot.title = element_text(hjust = 0.5))+
annotate("label", label = paste("Mean = ", round(test_data$estimate[2], 2)),
x = 2.3, y = 240,
alpha = 0.3)+
annotate("label", label = paste("Mean = ", round(test_data$estimate[1], 2)),
x = 0.7, y = 240,
alpha = 0.3)
kable(paste("p-value = ", test_data$p.value, ". So the null hypotesis is rejected and the Cranium diameter mean between the two gender is statistically different."), col.names = "")
| p-value = 1.41401856590502e-13 . So the null hypotesis is rejected and the Cranium diameter mean between the two gender is statistically different. |
model_1 = lm(Peso ~., data = df)
summary(model_1)
##
## Call:
## lm(formula = Peso ~ ., data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1123.26 -181.53 -14.45 161.05 2611.89
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6735.7960 141.4790 -47.610 < 2e-16 ***
## Anni.madre 0.8018 1.1467 0.699 0.4845
## N.gravidanze 11.3812 4.6686 2.438 0.0148 *
## FumatriciSmoker -30.2741 27.5492 -1.099 0.2719
## Gestazione 32.5773 3.8208 8.526 < 2e-16 ***
## Lunghezza 10.2922 0.3009 34.207 < 2e-16 ***
## Cranio 10.4722 0.4263 24.567 < 2e-16 ***
## Tipo.partoNat 29.6335 12.0905 2.451 0.0143 *
## Ospedaleosp2 -11.0912 13.4471 -0.825 0.4096
## Ospedaleosp3 28.2495 13.5054 2.092 0.0366 *
## SessoM 77.5723 11.1865 6.934 5.18e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 274 on 2487 degrees of freedom
## Multiple R-squared: 0.7289, Adjusted R-squared: 0.7278
## F-statistic: 668.7 on 10 and 2487 DF, p-value: < 2.2e-16
Non-Categorical Variable:
Categorical Variable:
For categorical variable the summary needs to be read slightly different. The coefficient means that taken as constant the other variable a change of class define a change of the output variable equal to the coefficient.
The starting point of the saturated model (model_1) is:
Correllation Matrix
panel.cor <- function(x, y, digits = 2, prefix = "", cex.cor, ...)
{
par(usr = c(0, 1, 0, 1))
r <- abs(cor(x, y))
txt <- format(c(r, 0.123456789), digits = digits)[1]
txt <- paste0(prefix, txt)
if(missing(cex.cor)) cex.cor <- 0.8/strwidth(txt)
text(0.5, 0.5, txt, cex = cex.cor * r)
}
pairs(df, upper.panel = panel.smooth, lower.panel = panel.cor)
Above there is a plot of the correlation Matrix where is possible to see the correlation between the quantitative variable.
As expected the three physical measurement are highly correlated, with length and cranium size that are related with a coefficient of 0.60 and are respectively correlated with the weight with a coefficient of 0.80 and 0.70. The correlation matrix confirm the relation between “Gestazione” and the physical measurements, with coefficient 0.59 with weight, 0.62 with length, 0.46 with Cranium size; this aligns with what it’s known about pregnancy. From the chart it’s also possible to see that the relation between the months of the pregnancy and the physical measurements starts with a linear behavior and has a sort of plateau reached a certain body size.
The high correlation between cranium size and length suggest that it can be interaction within the structure of the model, as well as length and “Gestazione”.
Searching the right model
modelcomparison = function(model_A, model_B){
dataanova = anova(model_A, model_B)
pvalue = dataanova$`Pr(>F)`[2]
aic_a = AIC(model_A)
bic_a = BIC(model_A)
aic_b = AIC(model_B)
bic_b = BIC(model_B)
summ_a = summary(model_A)
summ_b = summary(model_B)
r_a = round(summ_a$r.squared, 3)
r_b = round(summ_b$r.squared, 3)
mse_model_a = mean(model_A$residuals^2)
rmse_model_a = sqrt(mse_model_a)
mse_model_b = mean(model_B$residuals^2)
rmse_model_b = sqrt(mse_model_b)
model <- c("Starting Model", "New Model")
AIC <- c(aic_a, aic_b)
BIC <- c(bic_a, bic_b)
R <- c(r_a, r_b)
RMSE_value <- c(rmse_model_a, rmse_model_b)
anovatest <- c("", round(pvalue, 3))
tablecomparison <- data.frame(Models = model, AIC_value = AIC, BIC_value = BIC, R_squared = R, RMSE = RMSE_value, ANOVA_pvalue = anovatest)
return(tablecomparison)
}
model_2 = update(model_1, ~. - Ospedale)
table12 = modelcomparison(model_1, model_2)
kable(table12)
| Models | AIC_value | BIC_value | R_squared | RMSE | ANOVA_pvalue |
|---|---|---|---|---|---|
| Starting Model | 35145.57 | 35215.45 | 0.729 | 273.4174 | |
| New Model | 35150.75 | 35208.98 | 0.728 | 273.9203 | 0.01 |
The anova test suggest that without the hospital variable the model are statistically different to the first model, with a p-value around 0.01 that is lower than 0.05. From the table above the BIC coefficient for model one (saturated model) is slightly lower to the new model’s coefficient, on the other side the AIC coefficient tells the opposite story. This suggests, with the R\(^{2}\) value that is the same between the two model, that the two model explain the same amount of information about the newborn’s weight, but the AIC parameter prefer over-parametrize model and BIC prefer simpler model. In this case simpler model is better because can lead to identify better what really influence the weight considering that there should be no difference between hospitals.
model_3 = update(model_2, ~. - Anni.madre)
table23 = modelcomparison(model_2, model_3)
kable(table23)
| Models | AIC_value | BIC_value | R_squared | RMSE | ANOVA_pvalue |
|---|---|---|---|---|---|
| Starting Model | 35150.75 | 35208.98 | 0.728 | 273.9203 | |
| New Model | 35149.33 | 35201.73 | 0.728 | 273.9518 | 0.45 |
The test confirm the behavior suggested by the summary of the saturated model. - pvalue of the anova test above the threshold of 0.05, so the two model are statistically equal. - BIC and AIC coefficient are very close, so the information explained by the two model remain equal, with a slight improvement in the coefficients of the new model especially in the BIC coefficient.
model_4 = update(model_3, ~. - Cranio)
table34 = modelcomparison(model_3, model_4)
kable(table34)
| Models | AIC_value | BIC_value | R_squared | RMSE | ANOVA_pvalue |
|---|---|---|---|---|---|
| Starting Model | 35149.33 | 35201.73 | 0.728 | 273.9518 | |
| New Model | 35692.26 | 35738.84 | 0.661 | 305.5233 | 0 |
Without the Cranium size variable R\(^{2}\) goes from 0.73 to 0.66, so the percentage of information explained by the new model decrease significantly; this is confirmed by the pvalue of the anova test that is below the threshold. BIC and AIC coeffienct confirm that the starting model is better.
model_5 = update(model_3, ~. -Cranio - Lunghezza + Cranio * Lunghezza)
table35 = modelcomparison(model_3, model_5)
kable(table35)
| Models | AIC_value | BIC_value | R_squared | RMSE | ANOVA_pvalue |
|---|---|---|---|---|---|
| Starting Model | 35149.33 | 35201.73 | 0.728 | 273.9518 | |
| New Model | 35128.71 | 35186.95 | 0.730 | 272.7146 | 0 |
Introducing an interaction factor between Cranium size and body length the anova test tells that the two model are statistically different and the BIC/AIC coefficient suggest that the model with the interaction factor explain better the newborn’s weight, with a lower BIC coefficient. In this case also the AIC coefficient is better for the new model. Meanwhile the percentage of the phenomenon explained by the model remain stable.
Is very clear that the physical measurements is clearly one of the best way to predict the newborn’s weight. So it can be studied if the non linearity coefficient of the length and Cranio can make the model improve,
final_model = update(model_5, ~. + I(Lunghezza**2) + I(Cranio**2))
table5f = modelcomparison(model_5, final_model)
kable(table5f)
| Models | AIC_value | BIC_value | R_squared | RMSE | ANOVA_pvalue |
|---|---|---|---|---|---|
| Starting Model | 35128.71 | 35186.95 | 0.73 | 272.7146 | |
| New Model | 35042.83 | 35112.71 | 0.74 | 267.8520 | 0 |
The non linear coefficient of the newborn’s length improve the model with a general imporvement of R\(^{2}\), AIC, BIC coefficients and RMSE.
WINNING MODEL
kable(final_model$coefficients, col.names = "Coefficient values")
| Coefficient values | |
|---|---|
| (Intercept) | -770.1966995 |
| N.gravidanze | 14.5143328 |
| FumatriciSmoker | -23.5146476 |
| Gestazione | 41.1071786 |
| Tipo.partoNat | 28.4210553 |
| SessoM | 71.8929637 |
| Cranio | 4.2662118 |
| Lunghezza | -11.7036094 |
| I(Lunghezza^2) | 0.0438045 |
| I(Cranio^2) | 0.0524431 |
| Cranio:Lunghezza | -0.0599864 |
Multicollinearity
vif_table = vif(final_model, type = "predictor")
kable(vif_table, caption = "Multicollinearity coefficients")
| GVIF | Df | GVIF^(1/(2*Df)) | Interacts With | Other Predictors | |
|---|---|---|---|---|---|
| N.gravidanze | 1.028617 | 1 | 1.014208 | – | Fumatrici, Gestazione, Tipo.parto, Sesso, Cranio, Lunghezza |
| Fumatrici | 1.007964 | 1 | 1.003974 | – | N.gravidanze, Gestazione, Tipo.parto, Sesso, Cranio, Lunghezza |
| Gestazione | 1.853892 | 1 | 1.361577 | – | N.gravidanze, Fumatrici, Tipo.parto, Sesso, Cranio, Lunghezza |
| Tipo.parto | 1.004512 | 1 | 1.002253 | – | N.gravidanze, Fumatrici, Gestazione, Sesso, Cranio, Lunghezza |
| Sesso | 1.048700 | 1 | 1.024061 | – | N.gravidanze, Fumatrici, Gestazione, Tipo.parto, Cranio, Lunghezza |
| Cranio | 1.917242 | 5 | 1.067254 | I(Cranio^2), Lunghezza | N.gravidanze, Fumatrici, Gestazione, Tipo.parto, Sesso |
| Lunghezza | 1.917242 | 5 | 1.067254 | I(Lunghezza^2), Cranio | N.gravidanze, Fumatrici, Gestazione, Tipo.parto, Sesso |
The multicollinearity coefficients are under 5, so the model assumption on multicollinearity is respected.
General Overview
plot(final_model)
A. Residuals vs Fitted
The residuals plot is consistent with a random distribution around zero
B. Q-Q Plot
The points should be around the bisector, and the results have some problems at the extreme with the model that is not consistent at the extreme part, probably due to some outliers
C. Scale-Location
The points are distributied randomly around a constant between 1 and 0.5.
D. Residuals vs. Leverage
Here is possible to see that 3 values are outliers. Where 310 and 1429 are close to the 0.5 limit, so they shouldn’t be a problem for the model; but 1551 is outside of the threshold of 1, so can have an influence on the model.
Here the outliers:
kable(df[1549, ])
| Anni.madre | N.gravidanze | Fumatrici | Gestazione | Peso | Lunghezza | Cranio | Tipo.parto | Ospedale | Sesso | |
|---|---|---|---|---|---|---|---|---|---|---|
| 1551 | 35 | 1 | Not a Smoker | 38 | 4370 | 315 | 374 | Nat | osp3 | F |
It seems a normal datapoint, where a non smoker mother was able to do a Natural delivery at the end of the pregnancy itself. The weight is above way bigger than the female average of 3161 and this can be the reason. Some pregnancy that lead to a newborn’s with an high weight exists and can also be due to different reason as well as diseases, so it’s correct to have it in the dataset even if can influence the model.
Residuals
ggplot()+
geom_density(aes(residuals(final_model)))+
labs(title = "Final Model: Density plot of the Residuals", y = "Density", x = "Residuals")+
theme_minimal()+theme(plot.title = element_text(hjust = 0.5))
shap = shapiro.test(residuals(final_model))
bp = bptest(final_model)
dw = dwtest(final_model)
Tests <- c("Shapiro Test", "Breusch-Pagan Test", "Durbin-Watson Test")
Pvalue <- c(round(shap$p.value, 3), round(bp$p.value, 3), round(dw$p.value, 3))
recap_dataframe <- data.frame(Test = Tests, pvalue = Pvalue)
kable(recap_dataframe)
| Test | pvalue |
|---|---|
| Shapiro Test | 0.000 |
| Breusch-Pagan Test | 0.000 |
| Durbin-Watson Test | 0.085 |
From the result above it’s easy to understand that there are some problems with the model.
Outliers
A brief study of the outliers can explain what is happening.
lev = hatvalues(final_model)
p = sum(lev)
n = nrow(df)
Threshold = 2* p/n
plot(lev, main = "Leverage values", ylab = "Leverage values", xlab = "Index", pch = 1)
abline(h = Threshold, col = "red", lwd = 1, lty = 2)
text(x = 2200,
y = 0.58,
labels = paste("Number of outliers =", length(lev[lev > Threshold])),
col = "black", font = 2, cex = 0.8)
legend(
"topleft",
legend = c("Leverage", "Threshold"),
col = c("black", "red"),
pch = c(1, NA),
lty = c(NA, 2),
lwd = c(NA, 1),
bty = "n")
From the leverage plot, can already be seen that there are more than 200 outliers.
plot(rstudent(final_model), main = "Residuals rinormalized", ylab = "Studentize residuals")
abline(h=c(-2,2), col = "red")
legend(
"topleft",
legend = c("Residuals", "Threshold"),
col = c("black", "red"),
pch = c(1, NA),
lwd = c(NA, 1),
bty = "n")
Also through the tstudent ri-normalization it clear that the residuals have outliers. So the outliers can be studied further.
detach(df)
outliers = which(lev < Threshold)
df["Leverage value"] = lev
ggplot(data = df)+
geom_point(aes(x = Fumatrici, y = `Leverage value`))+
labs( title = "Leverage values by Smoker habits", x = "Smoker", y = "Leverage Value")+
theme_classic()+
theme(plot.title = element_text(hjust = 0.5))
Smoking create almost every time an “out of standard” residuals of the newborn’s weight, enough to influence the regression model to be unpredictable.
df_outliers = df[-outliers, ]
ggplot(data = df_outliers)+
geom_col(aes(x = Fumatrici), stat = "Count", fill = "5A7FA8", col = "black")+
labs( title = "Count of data with Leverage value above threshold", x = "Smoking", y = "Count")+
theme_classic()+
theme(plot.title = element_text(hjust = 0.5))
notoutliers = which(lev > Threshold)
df_clean = df[-notoutliers,]
ggplot(data = df_clean)+
geom_col(aes(x = Fumatrici), stat = "Count", fill = "5A7FA8", col = "black")+
labs( title = "Count of data with Leverage value below threshold", x = "Smoking", y = "Count")+
theme_classic()+
theme(plot.title = element_text(hjust = 0.5))
This last two column plots confirm that a smoker is always out of standard, for this dataset at least, influencing the regression model.
Let’s see if a model with only non smoker datapoints reflect better the requirment for a regression model.
df_non_smoker <- df[df$Fumatrici == "Not a Smoker", ]
df_non_smoker$Fumatrici = NULL
saturated_model_nonsmok = lm(Peso ~. - `Leverage value`, data = df_non_smoker)
summary(saturated_model_nonsmok)
##
## Call:
## lm(formula = Peso ~ . - `Leverage value`, data = df_non_smoker)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1113.79 -181.67 -14.27 162.97 2577.80
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6736.1069 143.3650 -46.986 < 2e-16 ***
## Anni.madre 0.7243 1.1739 0.617 0.53728
## N.gravidanze 12.5362 4.7702 2.628 0.00864 **
## Gestazione 34.5836 3.8885 8.894 < 2e-16 ***
## Lunghezza 10.1097 0.3062 33.018 < 2e-16 ***
## Cranio 10.4955 0.4340 24.183 < 2e-16 ***
## Tipo.partoNat 33.1134 12.3394 2.684 0.00733 **
## Ospedaleosp2 -10.0486 13.7665 -0.730 0.46551
## Ospedaleosp3 33.2505 13.8054 2.409 0.01609 *
## SessoM 79.7595 11.4224 6.983 3.74e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 274.3 on 2384 degrees of freedom
## Multiple R-squared: 0.7302, Adjusted R-squared: 0.7292
## F-statistic: 717 on 9 and 2384 DF, p-value: < 2.2e-16
The r-squared is similar to the previous one, with a value of 0.73.
To choose the final model for non smokers is taken in consideration:
attach(df_non_smoker)
modelcomparison_2 = function(model_A, model_B){
dataanova = anova(model_A, model_B)
pvalue = dataanova$`Pr(>F)`[2]
aic_a = AIC(model_A)
bic_a = BIC(model_A)
aic_b = AIC(model_B)
bic_b = BIC(model_B)
summ_a = summary(model_A)
summ_b = summary(model_B)
r_a = round(summ_a$r.squared, 3)
r_b = round(summ_b$r.squared, 3)
mse_model_a = mean(model_A$residuals^2)
rmse_model_a = sqrt(mse_model_a)
mse_model_b = mean(model_B$residuals^2)
rmse_model_b = sqrt(mse_model_b)
model <- c("Saturated Model", "Final Model")
AIC <- c(aic_a, aic_b)
BIC <- c(bic_a, bic_b)
R <- c(r_a, r_b)
RMSE_value <- c(rmse_model_a, rmse_model_b)
anovatest <- c("", round(pvalue, 3))
tablecomparison <- data.frame(Models = model, AIC_value = AIC, BIC_value = BIC, R_squared = R, RMSE = RMSE_value, ANOVA_pvalue = anovatest)
return(tablecomparison)
}
final_model_nosmok = update(saturated_model_nonsmok, ~. - Tipo.parto - Ospedale - Lunghezza - Cranio + Tipo.parto * Anni.madre + I(Lunghezza**2) + I(Cranio**2) )
kable(modelcomparison_2(saturated_model_nonsmok, final_model_nosmok))
| Models | AIC_value | BIC_value | R_squared | RMSE | ANOVA_pvalue |
|---|---|---|---|---|---|
| Saturated Model | 33687.07 | 33750.66 | 0.730 | 273.7448 | |
| Final Model | 33637.67 | 33695.48 | 0.736 | 271.0479 | NA |
The final model for the non smoker has a better BIC and AIC parameter and a slightly better R squared around 0.74.
With the same logic it’s possible to create a model for the mothers that smoke.
detach(df_non_smoker)
df_smoker <- df[df$Fumatrici == "Smoker", ]
df_smoker$Fumatrici = NULL
attach(df_smoker)
saturated_model_smok = lm(Peso ~. - `Leverage value`, data = df_smoker)
final_model_smok = update(saturated_model_smok, ~. - Tipo.parto - Ospedale - Lunghezza - Cranio + Tipo.parto * Anni.madre + I(Lunghezza**2) + I(Cranio**2) )
detach(df_smoker)
kable(modelcomparison_2(saturated_model_smok, final_model_smok))
| Models | AIC_value | BIC_value | R_squared | RMSE | ANOVA_pvalue |
|---|---|---|---|---|---|
| Saturated Model | 1453.224 | 1482.312 | 0.756 | 235.5555 | |
| Final Model | 1449.915 | 1476.359 | 0.759 | 234.0780 | NA |
shap_nonsmok = shapiro.test(residuals(final_model_nosmok))
bp_nonsmok = bptest(final_model_nosmok)
dw_nonsmok = dwtest(final_model_nosmok)
shap_smok = shapiro.test(residuals(final_model_smok))
bp_smok = bptest(final_model_smok)
dw_smok = dwtest(final_model_smok)
model <- c("Not smoker model", "Smoker Model")
Shapiro_Test <- c(round(shap_nonsmok$p.value, 3), round(shap_smok$p.value, 3))
BP_Test <- c(round(bp_nonsmok$p.value, 3), round(bp_smok$p.value, 3))
DW_Test <- c(round(dw_nonsmok$p.value, 3), round(dw_smok$p.value, 3))
tabletests <- data.frame(Models = model, Shap_test = Shapiro_Test, BP_test = BP_Test, DW_test = DW_Test)
kable(tabletests)
| Models | Shap_test | BP_test | DW_test |
|---|---|---|---|
| Not smoker model | 0.000 | 0.000 | 0.037 |
| Smoker Model | 0.361 | 0.062 | 0.788 |
Not Smoking Model:
plot(final_model_nosmok)
Smoking Model:
plot(final_model_smok)
col_to_keep = c("N.gravidanze", "Fumatrici", "Gestazione", "Tipo.parto", "Sesso", "Cranio", "Lunghezza")
new_data = df[, col_to_keep]
prevision = predict(final_model, newdata = new_data)
new_data["Peso"] = prevision
new_data["Category"] = "Prediction"
df["Category"] = "Actual"
df_bind = df[, colnames(new_data)]
df_totale <- rbind(df_bind, new_data)
ggplot(data = df_totale, aes(x = Peso, fill = Category)) +
geom_density(alpha = 0.4) +
theme_minimal()+
labs(title = "Distribution comparison: Actual vs Simulated - Original Dataset",
x = "Weight", y = "Density")+
theme(plot.title = element_text(hjust = 0.5))
With the training data the model seems stable and predict similar value to the actual model.
Now to test if the model is stable with a different dataset we can use the bootstrapping method. A new dataframe is create with samples of the row of the initial dataset so that the proportion and the underlying natural laws are conserved, but it’s still possible to see if the model can handle new dataset.
df_train = df[,col_to_keep]
new_data <- df[sample(nrow(df_train), size = nrow(df_train) * 2, replace = TRUE), ]
prediction <- predict(final_model, newdata = new_data)
new_data["Peso"] = prediction
new_data["Category"] = "Prediction"
df["Category"] = "Actual"
df_bind = df[, colnames(new_data)]
df_totale <- rbind(df_bind, new_data)
ggplot(data = df_totale, aes(x = Peso, fill = Category)) +
geom_density(alpha = 0.4) +
theme_minimal()+
labs(title = "Distribution comparison: Actual vs Simulated - Bootstrapping",
x = "Weight", y = "Density")+
theme(plot.title = element_text(hjust = 0.5))
The model is still stable even with the new dataset generated from the training data. With the same logic is possible to generate a new synthetic dataset that reflect the underlying natural laws, so that there isn’t possibility of impossible combination of, for example, Cranium size, lenght and newborn’s weight.
new_data <- mysynt$syn
prediction <- predict(final_model, newdata = new_data)
new_data["Peso"] = prediction
new_data["Category"] = "Prediction"
df["Category"] = "Actual"
df_bind = df[, colnames(new_data)]
df_totale <- rbind(df_bind, new_data)
ggplot(data = df_totale, aes(x = Peso, fill = Category)) +
geom_density(alpha = 0.4) +
theme_minimal()+
labs(title = "Distribution comparison: Actual vs Simulated - Synthetic Dataset",
x = "Weight", y = "Density")+
theme(plot.title = element_text(hjust = 0.5))
The model remain stable and accurate with the synthetic dataset too.
col_to_keep = c("N.gravidanze", "Anni.madre", "Gestazione", "Tipo.parto", "Sesso", "Cranio", "Lunghezza")
new_data = df_non_smoker[, col_to_keep]
prevision = predict(final_model_nosmok, newdata = new_data)
new_data["Peso"] = prevision
new_data["Category"] = "Prediction"
df_non_smoker["Category"] = "Actual"
df_bind = df_non_smoker[, colnames(new_data)]
df_totale <- rbind(df_bind, new_data)
ggplot(data = df_totale, aes(x = Peso, fill = Category)) +
geom_density(alpha = 0.4) +
theme_minimal()+
labs(title = "Distribution comparison: Actual vs Simulated - Original Dataset",
x = "Weight", y = "Density")+
theme(plot.title = element_text(hjust = 0.5))
The model seems stable and predict similar value to the actual dataset
Bootstrapping method.
df_train = df_non_smoker[,col_to_keep]
new_data <- df_non_smoker[sample(nrow(df_train), size = nrow(df_train) * 2, replace = TRUE), ]
prediction <- predict(final_model_nosmok, newdata = new_data)
new_data["Peso"] = prediction
new_data["Category"] = "Prediction"
df_non_smoker["Category"] = "Actual"
df_bind = df_non_smoker[, colnames(new_data)]
df_totale <- rbind(df_bind, new_data)
ggplot(data = df_totale, aes(x = Peso, fill = Category)) +
geom_density(alpha = 0.4) +
theme_minimal()+
labs(title = "Distribution comparison: Actual vs Simulated - Bootstrapping",
x = "Weight", y = "Density")+
theme(plot.title = element_text(hjust = 0.5))
The model is still stable even with the new dataset generated from the training data. As before now it’s possible to study the model with a synthetic dataset.
new_data <- mysynt$syn
prediction <- predict(final_model_nosmok, newdata = new_data)
new_data["Peso"] = prediction
new_data["Category"] = "Prediction"
df_non_smoker["Category"] = "Actual"
df_bind = df_non_smoker[, colnames(new_data)]
df_totale <- rbind(df_bind, new_data)
ggplot(data = df_totale, aes(x = Peso, fill = Category)) +
geom_density(alpha = 0.4) +
theme_minimal()+
labs(title = "Distribution comparison: Actual vs Simulated - Synthetic Dataset",
x = "Weight", y = "Density")+
theme(plot.title = element_text(hjust = 0.5))
The model remain stable and accurate with the synthetic dataset too.
col_to_keep = c("N.gravidanze", "Anni.madre", "Gestazione", "Tipo.parto", "Sesso", "Cranio", "Lunghezza")
new_data = df_smoker[, col_to_keep]
prevision = predict(final_model_smok, newdata = new_data)
new_data["Peso"] = prevision
new_data["Category"] = "Prediction"
df_smoker["Category"] = "Actual"
df_bind = df_smoker[, colnames(new_data)]
df_totale <- rbind(df_bind, new_data)
ggplot(data = df_totale, aes(x = Peso, fill = Category)) +
geom_density(alpha = 0.4) +
theme_minimal()+
labs(title = "Distribution comparison: Actual vs Simulated - Original Dataset",
x = "Weight", y = "Density")+
theme(plot.title = element_text(hjust = 0.5))
df_train = df_smoker[,col_to_keep]
new_data <- df_smoker[sample(nrow(df_train), size = nrow(df_train) * 2, replace = TRUE), ]
prediction <- predict(final_model_nosmok, newdata = new_data)
new_data["Peso"] = prediction
new_data["Category"] = "Prediction"
df_smoker["Category"] = "Actual"
df_bind = df_smoker[, colnames(new_data)]
df_totale <- rbind(df_bind, new_data)
ggplot(data = df_totale, aes(x = Peso, fill = Category)) +
geom_density(alpha = 0.4) +
theme_minimal()+
labs(title = "Distribution comparison: Actual vs Simulated - Bootstrapping",
x = "Weight", y = "Density")+
theme(plot.title = element_text(hjust = 0.5))
new_data <- mysynt$syn
prediction <- predict(final_model_smok, newdata = new_data)
new_data["Peso"] = prediction
new_data["Category"] = "Prediction"
df_smoker["Category"] = "Actual"
df_bind = df_smoker[, colnames(new_data)]
df_totale <- rbind(df_bind, new_data)
ggplot(data = df_totale, aes(x = Peso, fill = Category)) +
geom_density(alpha = 0.4) +
theme_minimal()+
labs(title = "Distribution comparison: Actual vs Simulated - Synthetic Dataset",
x = "Weight", y = "Density")+
theme(plot.title = element_text(hjust = 0.5))
Similar to the previous model, the Smoker model, is stable with the three different prediction simulation.
Now it’s possible to understand how the different variable impact the linear regression model
4.1 Gender
ggplot(data = df)+
geom_point(aes(x = Lunghezza * Cranio,
y = Peso,
col = Sesso), size = 0.8, position = "jitter")+
geom_smooth(aes(x = Lunghezza * Cranio,
y = Peso,
col = Sesso), se = FALSE, method = "lm")+
theme_minimal()+
labs(title = "Newborn's Measurments: Linear Regression by Gender",
x = "Body Length * Cranium Size", y = "Weight", color = "Gender")+
theme(plot.title = element_text(hjust = 0.5))
The gender variable determine two different regression where the male ones have higher measurments, but with same slope, so same relation between the three variable, as expected.
ggplot(data = df)+
geom_point(aes(x = Gestazione,
y = Peso,
col = Sesso), size = 0.8, position = "jitter")+
geom_smooth(aes(x = Gestazione,
y = Peso,
col = Sesso), se = FALSE, method = "lm")+
theme_minimal()+
labs(title = "Number of pregnancy month: Linear Regression by Gender",
x = "# Months", y = "Weight", color = "Gender")+
theme(plot.title = element_text(hjust = 0.5))
The same behavior can be seen with the duration of the pregnancy.
Also for both charts is it possible to see that the male cluster is above the female one in general, as expected.
4.3 Mother’s Age
Another important factor for a good pregnancy is often the mother’s age.
df$MotAge <- cut(
df$Anni.madre,
breaks = c(0, 18, 30, max(df$Anni.madre)),
labels = c("Minors", "18-30", "30+"),
right = TRUE,
include.lowest = TRUE
)
ggplot(data = df)+
geom_point(aes(x = Lunghezza * Cranio,
y = Peso,
col = MotAge), size = 0.8, position = "jitter", alpha = 1/5)+
geom_smooth(aes(x = Lunghezza * Cranio,
y = Peso,
col = MotAge), se = FALSE, method = "lm", linewidth = 0.5)+
theme_minimal()+
labs(title = "Newborn's Measurments: Linear Regression by Mother's age",
x = "Body Length * Cranium Size", y = "Weight", color = "Age Range")+
theme(plot.title = element_text(hjust = 0.5))+xlim(135000,190000)+ylim(2000,4000)
Different age lead to different slope, for example:
ggplot(data = df)+
geom_point(aes(x = Gestazione,
y = Peso,
col = MotAge), size = 0.8, position = "jitter")+
geom_smooth(aes(x = Gestazione,
y = Peso,
col = MotAge), se = FALSE, method = "lm")+
theme_minimal()+
labs(title = "Number of pregnancy month: Linear Regression by Mother's age",
x = "# Months", y = "Weight", color = "Age range")+
theme(plot.title = element_text(hjust = 0.5))
Considering the number of months of the pregnancy the same behavior is highlighted. The most stable and reliable age range is between 18 and 30.
4.3 Smoking habits
g1 = ggplot(data = df)+
geom_point(aes(x = Lunghezza * Cranio,
y = Peso,
col = Fumatrici), size = 0.8, position = "jitter")+
geom_smooth(aes(x = Lunghezza * Cranio,
y = Peso,
col = Fumatrici), se = FALSE, method = "lm")+
theme_minimal()+ylim(3100, 4100)+xlim(150000, 190000)+labs(color = "Smoking Habits")+xlab(NULL)+ylab(NULL)
g2 = ggplot(data = df)+
geom_point(aes(x = Lunghezza * Cranio,
y = Peso,
col = Fumatrici), size = 0.8, position = "jitter")+
geom_smooth(aes(x = Lunghezza * Cranio,
y = Peso,
col = Fumatrici), se = FALSE, method = "lm")+
theme_minimal()+
labs(title = "Newborn's Measurments: Linear Regression by smoking habits",
x = "Body Length * Cranium Size", y = "Weight", color = "Smoking habits")+
theme(plot.title = element_text(hjust = -0.3), legend.position = "none",
axis.title.x = element_text(hjust = 4, vjust = -1))
g2 | g1
Instead of gender, smoking habits influence the relation between Cranium size and body length with the newborn’s weight. The slope is higher in the smoker cluster, and taken as normal the non smoker cluster, this means that with smaller changes the output can change faster, so this means more unpredictability.
ggplot(data = df)+
geom_point(aes(x = Gestazione,
y = Peso,
col = Fumatrici), size = 0.8, position = "jitter")+
geom_smooth(aes(x = Gestazione,
y = Peso,
col = Fumatrici), se = FALSE, method = "lm")+
theme_minimal()+
labs(title = "Number of pregnancy months: Linear Regression by smoking habits",
x = "# months", y = "Weight", color = "Smoking habits")+
theme(plot.title = element_text(hjust = 0.5))+xlim(30, max(df$Gestazione))
The image above highlight perfectly the impact of smoking. More month of pregnancy doesn’t lead to an increase body weight. So smoking can be the reason of underweight babies. It’s also clear from the chart, that the datapoint of the smoking cluster is at the edge of the non smoking group, with less point above the pink regression line and more under the line.
This project lead to some very important conclusion highlighting the impact of different variable to the prediction of the newborn’s baby. The most important output was understanding the real impact of smoking habits; smoking lead to an unpredictable pregnancy, so Neonatal Health Solutions should invest in preventing smoking during the pregnancy. Another important component is the age of the mother, at a very young age the body of the mother is not ready for a long pregnancy, so the prevention of unwanted pregnancies and a good care of the health of the body of the mother can helps pregnancy at a very young age; meanwhile at an older age (30+) the first months of the pregnancy seem to be the most dangerous, so the prevention and care should start at the very beginning of the pregnancy. Also, from the models, the body measurements of the newborn’s can be a very important indicator of the future weight of the baby, so Neonatal Health Solutions should invest in better equipment to measure body length and cranium size.