Analysis of Factors of Low Birth Weight

The data on 189 births were collected at Bay state Medical Center,Springfield,Mass,during 1986.The data set contains an indicator of low infant birth weight as a reponse and several risk factors associated with birth weight. The data set contains the following variables: + low : indicator of birth weight either >=2.5kg(normal weight) or <2.5kg(underweight) + age :mother’s age in years + lwt : Mother’s weight in pounds at last menstrual period + race: Mother’s race(white, black or other) + smoke: Smoking status during pregnancy(yes,no) + ht :History of hypertension(yes,no) + ui : Presence of uterine irritability(yes,no) + ftv: Number of physician visits during 1st trimester(0-6) + ptl : Number of previous premature labours(0-3) + bwt :Birth weight in grams

We will first tidy our dataset by; + categorizing our ordinal and categorical variables into factors + converting lwt and bwt into the same unit of measurement(kgs)

library(SmartEDA)
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
library(ISLR)
ExpCatViz(
  new %>%
    select(low,ftv),
  target="ftv"
)
## [[1]]

The graph above shows the effect of physician visits on the birth weight. The barplot is placed proportions.Mothers who made 6 physician visits during the first semester had normal weighted babies(>=2.5kg).This could translate that the more physician visits during the first semester, the higher the chances of having normal weight babies and the less likely the mothers would have underweight babies. However we will ascertain this speculation by conducting hypothesis tests.

Descriptive Statistics

Descriptive statistics is used to explore numeric variables either as a whole or separated in groups of categorical variables.

## 
## Attaching package: 'dlookr'
## The following object is masked from 'package:base':
## 
##     transform

The table above gives us the summary statistics for the numeric variables. Mother’s age(age),Mother’s weight(lwt) and the Birth weight of the child(bwt). We can as well use tbl_summary from gtsummary package to get the descriptive statistics.

## 
## Attaching package: 'gtsummary'
## The following objects are masked from 'package:flextable':
## 
##     as_flextable, continuous_summary

Categorical variables are summarized by counts and percentages, while numeric variables by mean and standard deviation. by species the grouping variable, which in this case is the birth weight indicator. add_p() conducts statistical tests with all variables and provides p_values. Non_parametric variables uses Wilcoxon rank sum test for comparing two groups. Categorical Variables are checked with Fisher’s exact test if the number of observations in any groups are less than 5 observations. Pearson’s Chi-squared test is used for number of variables in any groups that are more than 5.

Explore Distribution with skewness and kurtosis test

Skewness

Histograms and density plots offers us a glimpse of the data on continuous variables.Skewness is used to measure the lack of symmetry of a variable.A data is symmetric if it looks the same to the left and right of the central point.The skewness for a perfectly symmetrical distribution is 0. Positive skewness in a data indicates that the data is skewed to the right, while a negatively skewed distribution is skewed to the left.

Birth weight(bwt) appears bell-shaped(symmetric) while Mother’s age(age) and Mother’s weight(lwt) is right-skewed.

Let’s test for skewness on the mother’s age

library(moments)
## 
## Attaching package: 'moments'
## The following objects are masked from 'package:dlookr':
## 
##     kurtosis, skewness
agostino.test(new$age)
## 
##  D'Agostino skewness test
## 
## data:  new$age
## skew = 0.71644, z = 3.79630, p-value = 0.0001469
## alternative hypothesis: data have a skewness

From the p-value the skewness of age indicates that the data is significantly skewed and therefore rejects the null hypothesis about the data not skewed and therefore not normally distributed.

Testing for skewness on the Birth weight

## 
##  D'Agostino skewness test
## 
## data:  new$bwt
## skew = -0.20698, z = -1.19271, p-value = 0.233
## alternative hypothesis: data have a skewness

From the p-value the skewness of birth weight indicates that the data is not significantly skewed and therefore fails to reject the null hypothesis about the data not skewed and therefore is normally distributed.The skewness indicates the birth weight is slightly skewed to the left.

Testing for skewness on th mother’s age

agostino.test(new$lwt)
## 
##  D'Agostino skewness test
## 
## data:  new$lwt
## skew = 1.3912, z = 6.3054, p-value = 2.874e-10
## alternative hypothesis: data have a skewness

From the p-value the skewness of mother’s weight indicates that the data is significantly skewed and therefore rejects the null hypothesis about the data not skewed and therefore not normally distributed.The skewness is just the same as the mother’s age;right skewed.

Kurtosis

Kurtosis is the measure of heavy tails or outliers present in the distribution.The kurtosis value for a normal distribution is at around 3.Let’s test for kurtosis on our continuous variables

anscombe.test(new$age)
## 
##  Anscombe-Glynn kurtosis test
## 
## data:  new$age
## kurt = 3.5684, z = 1.5884, p-value = 0.1122
## alternative hypothesis: kurtosis is not equal to 3
anscombe.test(new$bwt)
## 
##  Anscombe-Glynn kurtosis test
## 
## data:  new$bwt
## kurt = 2.888821, z = -0.097807, p-value = 0.9221
## alternative hypothesis: kurtosis is not equal to 3
anscombe.test(new$lwt)
## 
##  Anscombe-Glynn kurtosis test
## 
## data:  new$lwt
## kurt = 5.3108, z = 3.7726, p-value = 0.0001616
## alternative hypothesis: kurtosis is not equal to 3

For birth weight and the mother’s age the kurtosis values are not significantly further away from 3.Indicating that the data is normally distributed and no possible outliers.In contrast, the kurtosis for the mother’s weight is significantly further away from 3 and the p-value indicates that the data is not normally distributed and there are probable outliers.

Normality

The normality of the distribution should be checked.This helps us to determine the correct statistical test. If the data is normally distributed , we ought to use parametric tests for instance : t-test(for 2 groups) or anova(>2 groups).If however the data is not normally distributed we should use non-parametric tests like Mann-Whitney or Kruskal-Wallis. To check for normality we can use QQ-plots and Shapiro-Wilk.

plot_qq(new)

plot_qq(new,by="low")

The qq plot can be interpreted in the following way, if points are situated close to the diagonal line,the data is probably normally distributed.But how close is close? We need a statistical test just to be sure . And that’s where Shapiro -Wilk comes in.

normality(new)%>%
  mutate_if(is.numeric, ~round(.,3))%>%
  flextable()

We can conclude that the birth weight is not normally distributed.

Compare Groups

Box plots help us to explore a combination of numeric and categorical variables.They mostly show is the distribution of both groups differ.

library(ggstatsplot)
## Registered S3 method overwritten by 'parameters':
##   method                         from      
##   format.parameters_distribution datawizard
## You can cite this package as:
##      Patil, I. (2021). Visualizations with statistical details: The 'ggstatsplot' approach.
##      Journal of Open Source Software, 6(61), 3167, doi:10.21105/joss.03167
ggbetweenstats(data=new, x= smoke ,y=bwt, type = "np")

The p-value indicates that the mean birth weight differs significantly between mother’s who smoke and don’t smoke,the same can be said on mean birth weight between mother’s with uterine irritability

library(ggstatsplot)
ggbetweenstats(data=new, x= ui ,y=bwt, type = "np")

library(ggstatsplot)
ggbetweenstats(data=new, x= race ,y=bwt, type = "np")

Explore Correlations

To check the relationship between numerical variables we can use correlate() function.

plot_correlate(new,method="kendall")
## Warning: 'plot_correlate' is deprecated.
## Use 'plot.correlate' instead.
## See help("Deprecated")

ggcorrmat(data=new)

Not significant correlations are crossed out.

Plotting correlation graphs for the mother’s age on birth weight.

ggscatterstats(
  data=new,
  x=age,
  y=bwt,
  type="np"
)
## Registered S3 method overwritten by 'ggside':
##   method from  
##   +.gg   GGally
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

As well as the correlation between mother’s weight and birth weight. If you noticed earlier from the descriptive summaries and kurtosis test, the mother’s weight has outliers.We apply a “robust” correlation to decrease the influence of outliers.

ggscatterstats(
  data=new,
  x=lwt,
  y=bwt,
  type="robust"
)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

library(ggstatsplot)
ggbarstats(data=new, x=low, y=ftv,label="both")

This visualization above shows the relationship between visits to the physician at first trimester to birth weight indicators.

Performing exploratory data analysis is a breeze once you know the right tools to use.

Thank you!