The data on 189 births were collected at Bay state Medical Center,Springfield,Mass,during 1986.The data set contains an indicator of low infant birth weight as a reponse and several risk factors associated with birth weight. The data set contains the following variables: + low : indicator of birth weight either >=2.5kg(normal weight) or <2.5kg(underweight) + age :mother’s age in years + lwt : Mother’s weight in pounds at last menstrual period + race: Mother’s race(white, black or other) + smoke: Smoking status during pregnancy(yes,no) + ht :History of hypertension(yes,no) + ui : Presence of uterine irritability(yes,no) + ftv: Number of physician visits during 1st trimester(0-6) + ptl : Number of previous premature labours(0-3) + bwt :Birth weight in grams
We will first tidy our dataset by; + categorizing our ordinal and categorical variables into factors + converting lwt and bwt into the same unit of measurement(kgs)
library(SmartEDA)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
library(ISLR)
ExpCatViz(
new %>%
select(low,ftv),
target="ftv"
)
## [[1]]
The graph above shows the effect of physician visits on the birth
weight. The barplot is placed proportions.Mothers who made 6 physician
visits during the first semester had normal weighted
babies(>=2.5kg).This could translate that the more physician visits
during the first semester, the higher the chances of having normal
weight babies and the less likely the mothers would have underweight
babies. However we will ascertain this speculation by conducting
hypothesis tests.
Descriptive statistics is used to explore numeric variables either as a whole or separated in groups of categorical variables.
##
## Attaching package: 'dlookr'
## The following object is masked from 'package:base':
##
## transform
variables | min | Q1 | mean | median | Q3 | max | zero | minus | outlier |
id | 4.00000 | 68.00000 | 121.079365 | 123.00000 | 176.00000 | 226.0000 | 0 | 0 | 0 |
age | 14.00000 | 19.00000 | 23.238095 | 23.00000 | 26.00000 | 45.0000 | 0 | 0 | 1 |
lwt | 36.28743 | 49.89522 | 58.885480 | 54.88474 | 63.50301 | 113.3982 | 0 | 0 | 14 |
bwt | 0.70900 | 2.41400 | 2.944286 | 2.97700 | 3.47500 | 4.9900 | 0 | 0 | 1 |
The table above gives us the summary statistics for the numeric variables. Mother’s age(age),Mother’s weight(lwt) and the Birth weight of the child(bwt). We can as well use tbl_summary from gtsummary package to get the descriptive statistics.
##
## Attaching package: 'gtsummary'
## The following objects are masked from 'package:flextable':
##
## as_flextable, continuous_summary
| Birth weight indicator |
| |
Characteristic | >=2.5 kg, N = 1301 | <2.5 kg, N = 591 | p-value2 |
Mother's age | 23.7 (5.6) | 22.3 (4.5) | 0.2 |
Age Category | 0.2 | ||
<19 | 36 (28%) | 15 (25%) | |
>29 | 23 (18%) | 4 (6.8%) | |
20-25 | 53 (41%) | 31 (53%) | |
25-29 | 18 (14%) | 9 (15%) | |
Mother's weight at last menstrual period | 60 (14) | 55 (12) | 0.013 |
Mother's race | 0.082 | ||
white | 73 (56%) | 23 (39%) | |
black | 15 (12%) | 11 (19%) | |
other | 42 (32%) | 25 (42%) | |
Smoking status | 0.026 | ||
yes | 86 (66%) | 29 (49%) | |
no | 44 (34%) | 30 (51%) | |
Number of previous premature labours | <0.001 | ||
0 | 118 (91%) | 41 (69%) | |
1 | 8 (6.2%) | 16 (27%) | |
2 | 3 (2.3%) | 2 (3.4%) | |
3 | 1 (0.8%) | 0 (0%) | |
Hypertension | 0.052 | ||
yes | 125 (96%) | 52 (88%) | |
no | 5 (3.8%) | 7 (12%) | |
Uterine irritability | 0.020 | ||
yes | 116 (89%) | 45 (76%) | |
no | 14 (11%) | 14 (24%) | |
No. of physician visits at 1st trimester | 0.3 | ||
0 | 64 (49%) | 36 (61%) | |
1 | 36 (28%) | 11 (19%) | |
2 | 23 (18%) | 7 (12%) | |
3 | 3 (2.3%) | 4 (6.8%) | |
4 | 3 (2.3%) | 1 (1.7%) | |
6 | 1 (0.8%) | 0 (0%) | |
Birth weight | 3.33 (0.48) | 2.10 (0.39) | <0.001 |
1Mean (SD); n (%) | |||
2Wilcoxon rank sum test; Pearson's Chi-squared test; Fisher's exact test | |||
Categorical variables are summarized by counts and percentages, while numeric variables by mean and standard deviation. by species the grouping variable, which in this case is the birth weight indicator. add_p() conducts statistical tests with all variables and provides p_values. Non_parametric variables uses Wilcoxon rank sum test for comparing two groups. Categorical Variables are checked with Fisher’s exact test if the number of observations in any groups are less than 5 observations. Pearson’s Chi-squared test is used for number of variables in any groups that are more than 5.
Histograms and density plots offers us a glimpse of the data on continuous variables.Skewness is used to measure the lack of symmetry of a variable.A data is symmetric if it looks the same to the left and right of the central point.The skewness for a perfectly symmetrical distribution is 0. Positive skewness in a data indicates that the data is skewed to the right, while a negatively skewed distribution is skewed to the left.
Birth weight(bwt) appears bell-shaped(symmetric) while Mother’s age(age)
and Mother’s weight(lwt) is right-skewed.
Let’s test for skewness on the mother’s age
library(moments)
##
## Attaching package: 'moments'
## The following objects are masked from 'package:dlookr':
##
## kurtosis, skewness
agostino.test(new$age)
##
## D'Agostino skewness test
##
## data: new$age
## skew = 0.71644, z = 3.79630, p-value = 0.0001469
## alternative hypothesis: data have a skewness
From the p-value the skewness of age indicates that the data is significantly skewed and therefore rejects the null hypothesis about the data not skewed and therefore not normally distributed.
Testing for skewness on the Birth weight
##
## D'Agostino skewness test
##
## data: new$bwt
## skew = -0.20698, z = -1.19271, p-value = 0.233
## alternative hypothesis: data have a skewness
From the p-value the skewness of birth weight indicates that the data is not significantly skewed and therefore fails to reject the null hypothesis about the data not skewed and therefore is normally distributed.The skewness indicates the birth weight is slightly skewed to the left.
Testing for skewness on th mother’s age
agostino.test(new$lwt)
##
## D'Agostino skewness test
##
## data: new$lwt
## skew = 1.3912, z = 6.3054, p-value = 2.874e-10
## alternative hypothesis: data have a skewness
From the p-value the skewness of mother’s weight indicates that the data is significantly skewed and therefore rejects the null hypothesis about the data not skewed and therefore not normally distributed.The skewness is just the same as the mother’s age;right skewed.
Kurtosis is the measure of heavy tails or outliers present in the distribution.The kurtosis value for a normal distribution is at around 3.Let’s test for kurtosis on our continuous variables
anscombe.test(new$age)
##
## Anscombe-Glynn kurtosis test
##
## data: new$age
## kurt = 3.5684, z = 1.5884, p-value = 0.1122
## alternative hypothesis: kurtosis is not equal to 3
anscombe.test(new$bwt)
##
## Anscombe-Glynn kurtosis test
##
## data: new$bwt
## kurt = 2.888821, z = -0.097807, p-value = 0.9221
## alternative hypothesis: kurtosis is not equal to 3
anscombe.test(new$lwt)
##
## Anscombe-Glynn kurtosis test
##
## data: new$lwt
## kurt = 5.3108, z = 3.7726, p-value = 0.0001616
## alternative hypothesis: kurtosis is not equal to 3
For birth weight and the mother’s age the kurtosis values are not significantly further away from 3.Indicating that the data is normally distributed and no possible outliers.In contrast, the kurtosis for the mother’s weight is significantly further away from 3 and the p-value indicates that the data is not normally distributed and there are probable outliers.
The normality of the distribution should be checked.This helps us to determine the correct statistical test. If the data is normally distributed , we ought to use parametric tests for instance : t-test(for 2 groups) or anova(>2 groups).If however the data is not normally distributed we should use non-parametric tests like Mann-Whitney or Kruskal-Wallis. To check for normality we can use QQ-plots and Shapiro-Wilk.
plot_qq(new)
plot_qq(new,by="low")
The qq plot can be interpreted in the following way, if points are
situated close to the diagonal line,the data is probably normally
distributed.But how close is close? We need a statistical test just to
be sure . And that’s where Shapiro -Wilk comes in.
normality(new)%>%
mutate_if(is.numeric, ~round(.,3))%>%
flextable()
vars | statistic | p_value | sample |
id | 0.956 | 0.000 | 189 |
age | 0.960 | 0.000 | 189 |
lwt | 0.893 | 0.000 | 189 |
bwt | 0.993 | 0.443 | 189 |
We can conclude that the birth weight is not normally distributed.
Box plots help us to explore a combination of numeric and categorical variables.They mostly show is the distribution of both groups differ.
library(ggstatsplot)
## Registered S3 method overwritten by 'parameters':
## method from
## format.parameters_distribution datawizard
## You can cite this package as:
## Patil, I. (2021). Visualizations with statistical details: The 'ggstatsplot' approach.
## Journal of Open Source Software, 6(61), 3167, doi:10.21105/joss.03167
ggbetweenstats(data=new, x= smoke ,y=bwt, type = "np")
The p-value indicates that the mean birth weight differs significantly
between mother’s who smoke and don’t smoke,the same can be said on mean
birth weight between mother’s with uterine irritability
library(ggstatsplot)
ggbetweenstats(data=new, x= ui ,y=bwt, type = "np")
library(ggstatsplot)
ggbetweenstats(data=new, x= race ,y=bwt, type = "np")
To check the relationship between numerical variables we can use correlate() function.
plot_correlate(new,method="kendall")
## Warning: 'plot_correlate' is deprecated.
## Use 'plot.correlate' instead.
## See help("Deprecated")
ggcorrmat(data=new)
Not significant correlations are crossed out.
Plotting correlation graphs for the mother’s age on birth weight.
ggscatterstats(
data=new,
x=age,
y=bwt,
type="np"
)
## Registered S3 method overwritten by 'ggside':
## method from
## +.gg GGally
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
As well as the correlation between mother’s weight and birth weight. If
you noticed earlier from the descriptive summaries and kurtosis test,
the mother’s weight has outliers.We apply a “robust” correlation to
decrease the influence of outliers.
ggscatterstats(
data=new,
x=lwt,
y=bwt,
type="robust"
)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
library(ggstatsplot)
ggbarstats(data=new, x=low, y=ftv,label="both")
This visualization above shows the relationship between visits to the
physician at first trimester to birth weight indicators.
Performing exploratory data analysis is a breeze once you know the right tools to use.
Thank you!