The first six rows of the raw data are presented in Table 1, these reflect data from three households. Each household is a single sample unit and there are 1996 households (i.e. families), of these there are 240 single parent households. However the raw data is structured with separate rows for each parent, resulting in 3992 rows of data. There are 13 variables.
In this section the raw data were cleaned and restructured to match the data in Table 2. This involved:
| ID | single_parent | parent | age | income | smoke | quit | under_14 | over_14 | sex | birthweight | weight_year1 | weight_year3 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | FALSE | mother | 24 | 400 | 0 | NA | 1 | 0 | 0 | 2730 | 8512 | 13378 |
| 1 | FALSE | father | 25 | 400 | 0 | NA | 1 | 0 | 0 | 2730 | 8512 | 13378 |
| 2 | FALSE | mother | 31 | 800 | 0 | NA | 1 | 0 | 1 | 3669 | 11156 | 16757 |
| 2 | FALSE | father | 39 | 1200 | 0 | NA | 1 | 0 | 1 | 3669 | 11156 | 16757 |
| 3 | FALSE | mother | 32 | 400 | 0 | NA | 1 | 0 | 0 | 3438 | 10322 | 13683 |
| 3 | FALSE | father | 33 | 800 | 0 | NA | 1 | 0 | 0 | 3438 | 10322 | 13683 |
age < 0The age variable contained implausible values (e.g. -99). These inconsistent rows were removed.
income as charactersIncome values that were placed as written values (e.g. “four hundred”) were exchanged to their numerical counterpart (e.g. “four hundred” becomes “400”), then the income variable was changed to numeric.
smoke and quit for mothersSome mothers marked as “never smoked” (smoke = 0) had quit = 1. These inconsistent rows were removed.
sex = -1 valuesSome were marked as sex = -1. These inconsistent rows were removed.
| ID | time | weight | single_parent | under_14 | over_14 | sex | age_mother | income_mother | income_father | smoke_mother | quit_mother |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | birthweight | 2730 | Single | 1 | 0 | Female | 24 | 400 | 400 | No | NA |
| 1 | weight_year1 | 8512 | Single | 1 | 0 | Female | 24 | 400 | 400 | No | NA |
| 1 | weight_year3 | 13378 | Single | 1 | 0 | Female | 24 | 400 | 400 | No | NA |
| 2 | birthweight | 3669 | Single | 1 | 0 | Male | 31 | 800 | 1200 | No | NA |
| 2 | weight_year1 | 11156 | Single | 1 | 0 | Male | 31 | 800 | 1200 | No | NA |
| 2 | weight_year3 | 16757 | Single | 1 | 0 | Male | 31 | 800 | 1200 | No | NA |
| 3 | birthweight | 3438 | Single | 1 | 0 | Female | 32 | 400 | 800 | No | NA |
| 3 | weight_year1 | 10322 | Single | 1 | 0 | Female | 32 | 400 | 800 | No | NA |
| 3 | weight_year3 | 13683 | Single | 1 | 0 | Female | 32 | 400 | 800 | No | NA |
The cleaned data is presented in Table 2. To structure the data in this format tidyr::pivot_wider() was first used to reshape the dataframe so there was a single row per household. Then families with non-valid data were removed. Then the dataframe was restructured into long format with the function tidyr::pivot_longer() and the variables reflecting father’s age and smoking status were removed.
The cleaned data has 5874 rows of data from 1958 family units. Thus, a total of 38 families were removed due to non-valid data.
In this section the following three new variables were created:
| Smoking status | Frequency |
|---|---|
| Never | 1618 |
| Former | 97 |
| Current | 218 |
| NA | 25 |
| Sum | 1958 |
Frequency counts for each category of the derived variable reflecting mother’s smoking status is shown in Table 3. There were 315 (16.1%) mothers who reported having ever smoked tobacco, of these, 97 mothers reported that they did not smoke during their pregnancy. The 25 (1.2%) mothers who responded “Prefer not to say” to this question were coded to NA values.
Equivalised income is a common measure of household income that takes account of the differences in a household’s size and composition. A derived variable called ehi (Equivalised household income) was created using the following transformation:
\(\text{Equivalised household income} = \frac{\text{Total household income}}{\text{Household size factor}}\)
Where, total household income is the combined income of the mother and the father. In single parent families, total household income is the mother’s income. The denominator, household size factor, is defined by the sum of:
For example, a household with two adults, one child aged 16, and one child aged 12 has a household size factor of:
\(1.0 + 2 \times 0.5 + 0.3 = 2.3\)
Income poverty is an indicator of low income and intended to reflect a lack of money to meet basic living needs. It is often defined by a household income below a threshold known as a poverty line. In this analysis, a new variable called poverty was defined by an equivalised household income less than 50% of the median equivalised household income.
For example, if the median equivalised household income was $500 per week, any income below $250 would be considered to indicate income poverty.
| Income poverty | N | Mean | Min | Max |
|---|---|---|---|---|
| 0 | 1381 | 481.6 | 190.5 | 2426.9 |
| 1 | 341 | 99.5 | 0.0 | 173.9 |
| NA | 236 | NaN | Inf | -Inf |
Summary statistics for the two derived variables ehi and poverty are presented in Table 4. There were 341 (17.4%) families with an equivilised household income less than half of the median equivilised household income (< $333 per week). These families were defined as living with income poverty at the time of birth of their youngest child.
There is concern that mothers who smoke during pregnancy are at risk of having children with a low birth weight. The aim of this section is to evaluate the extent to which child weight varies by maternal smoking status. In particular, we would like to know (a) if the average birth weight of children whose mother’s smoke during pregnancy is lower than the average birth weight of children whose mother’s do not smoke during pregnancy, and (b) if this difference is maintained by age 3.
This analysis reports simple descriptive statistics (mean and SD) and violin plots comparing the distribution of child weight at three time points by the smoking status of their mother. Child weight was recorded at birth, age 1, and age 3. Maternal smoking status was recorded at the time of birth.
| Smoking status | N | Mean birthweight | SD of birthweight | Mean weight at age 3 | SD of weight at age 3 |
|---|---|---|---|---|---|
| Never | 1618 | 3233 | 269 | 14308 | 1259 |
| Former | 97 | 3113 | 303 | 14524 | 1209 |
| Current | 218 | 2894 | 255 | 15631 | 1310 |
| NA | 25 | 3217 | 270 | 14254 | 1317 |
Figure 1: Child weight (grams) by maternal smoking status.
Interpretation: Children of current smokers showed lower weights compared to never smokers, there being a mean difference at birth of 339g. Former smokers’ children displayed intermediate weights at birth, suggesting partial recovery from smoking effects.
The weight gap between current and never smokers narrowed at age 3. All groups showed similar variability in weights, though current smokers had slightly wider distributions and conversely became had a heavier mean weight than never smokers by 51g. The consistent pattern across time points indicates lasting effects of maternal smoking on child growth trajectories.
Table 6: Summary statistics for mother’s age at birth
| N | Mean | SD | Minimum | Maximum |
|---|---|---|---|---|
| 1958 | 29.8 | 5.1 | 18 | 50 |
Figure 2: Comparison of observed mother’s age distribution with simulated distributions
Distribution Comparison:: The observed distribution of mothers’ ages (mean = 29.8, SD = 5.1) shows a slightly right-skewed pattern, with most mothers clustered between 25-35 years. Key observations:
Observed Data (Blue):
Normal Simulation (Orange):
Uniform Simulation (Red):
Conclusion: Neither theoretical distribution perfectly matches the observed data. The normal distribution captures the central tendency but fails to account for the right skew. The uniform distribution is clearly inappropriate. The observed distribution most closely resembles a right-skewed normal distribution, suggesting that while most mothers are concentrated around 30 years, there’s a longer tail of older mothers than younger ones. For modeling purposes, a normal distribution might serve as a reasonable approximation, though a skewed distribution would better capture the actual pattern.