Data cleaning

Raw data

The first six rows of the raw data are presented in Table 1, these reflect data from three households. Each household is a single sample unit and there are 1996 households (i.e. families), of these there are 240 single parent households. However the raw data is structured with separate rows for each parent, resulting in 3992 rows of data. There are 13 variables.

In this section the raw data were cleaned and restructured to match the data in Table 2. This involved:

  • formatting variables to be consistent with the data dictionary,
  • identifying and correcting any data coding errors,
  • removing sampling units (families) with non-valid data, and
  • reshaping the dataframe.
Table 1: First three households in the raw data
ID single_parent parent age income smoke quit under_14 over_14 sex birthweight weight_year1 weight_year3
1 FALSE mother 24 400 0 NA 1 0 0 2730 8512 13378
1 FALSE father 25 400 0 NA 1 0 0 2730 8512 13378
2 FALSE mother 31 800 0 NA 1 0 1 3669 11156 16757
2 FALSE father 39 1200 0 NA 1 0 1 3669 11156 16757
3 FALSE mother 32 400 0 NA 1 0 0 3438 10322 13683
3 FALSE father 33 800 0 NA 1 0 0 3438 10322 13683

Data Cleaning and Corrections of Non-valid Values

Error 1: age < 0

The age variable contained implausible values (e.g. -99). These inconsistent rows were removed.

Error 2: income as characters

Income values that were placed as written values (e.g. “four hundred”) were exchanged to their numerical counterpart (e.g. “four hundred” becomes “400”), then the income variable was changed to numeric.

Error 3: Inconsistent smoke and quit for mothers

Some mothers marked as “never smoked” (smoke = 0) had quit = 1. These inconsistent rows were removed.

Error 4: Impossible sex = -1 values

Some were marked as sex = -1. These inconsistent rows were removed.

Cleaned data

Table 2: First three households in the cleaned data
ID time weight single_parent under_14 over_14 sex age_mother income_mother income_father smoke_mother quit_mother
1 birthweight 2730 Single 1 0 Female 24 400 400 No NA
1 weight_year1 8512 Single 1 0 Female 24 400 400 No NA
1 weight_year3 13378 Single 1 0 Female 24 400 400 No NA
2 birthweight 3669 Single 1 0 Male 31 800 1200 No NA
2 weight_year1 11156 Single 1 0 Male 31 800 1200 No NA
2 weight_year3 16757 Single 1 0 Male 31 800 1200 No NA
3 birthweight 3438 Single 1 0 Female 32 400 800 No NA
3 weight_year1 10322 Single 1 0 Female 32 400 800 No NA
3 weight_year3 13683 Single 1 0 Female 32 400 800 No NA

The cleaned data is presented in Table 2. To structure the data in this format tidyr::pivot_wider() was first used to reshape the dataframe so there was a single row per household. Then families with non-valid data were removed. Then the dataframe was restructured into long format with the function tidyr::pivot_longer() and the variables reflecting father’s age and smoking status were removed.

The cleaned data has 5874 rows of data from 1958 family units. Thus, a total of 38 families were removed due to non-valid data.

New Variables

In this section the following three new variables were created:

Mother’s Smoking Status

Table 3: Mother’s recoded smoking status in the cleaned data
Smoking status Frequency
Never 1618
Former 97
Current 218
NA 25
Sum 1958

Frequency counts for each category of the derived variable reflecting mother’s smoking status is shown in Table 3. There were 315 (16.1%) mothers who reported having ever smoked tobacco, of these, 97 mothers reported that they did not smoke during their pregnancy. The 25 (1.2%) mothers who responded “Prefer not to say” to this question were coded to NA values.

Household income

Equivalised income is a common measure of household income that takes account of the differences in a household’s size and composition. A derived variable called ehi (Equivalised household income) was created using the following transformation:

\(\text{Equivalised household income} = \frac{\text{Total household income}}{\text{Household size factor}}\)

Where, total household income is the combined income of the mother and the father. In single parent families, total household income is the mother’s income. The denominator, household size factor, is defined by the sum of:

  • 1.0 for the first adult
  • 0.5 for each additional person aged 14 and older
  • 0.3 for each person aged under 14

For example, a household with two adults, one child aged 16, and one child aged 12 has a household size factor of:
\(1.0 + 2 \times 0.5 + 0.3 = 2.3\)

Income poverty is an indicator of low income and intended to reflect a lack of money to meet basic living needs. It is often defined by a household income below a threshold known as a poverty line. In this analysis, a new variable called poverty was defined by an equivalised household income less than 50% of the median equivalised household income.

For example, if the median equivalised household income was $500 per week, any income below $250 would be considered to indicate income poverty.

Table 4: Summary statistics for equivalised household income ($ per week) by poverty status in the cleaned data.
Income poverty N Mean Min Max
0 1381 481.6 190.5 2426.9
1 341 99.5 0.0 173.9
NA 236 NaN Inf -Inf

Summary statistics for the two derived variables ehi and poverty are presented in Table 4. There were 341 (17.4%) families with an equivilised household income less than half of the median equivilised household income (< $333 per week). These families were defined as living with income poverty at the time of birth of their youngest child.

Child weight by maternal smoking status

There is concern that mothers who smoke during pregnancy are at risk of having children with a low birth weight. The aim of this section is to evaluate the extent to which child weight varies by maternal smoking status. In particular, we would like to know (a) if the average birth weight of children whose mother’s smoke during pregnancy is lower than the average birth weight of children whose mother’s do not smoke during pregnancy, and (b) if this difference is maintained by age 3.

This analysis reports simple descriptive statistics (mean and SD) and violin plots comparing the distribution of child weight at three time points by the smoking status of their mother. Child weight was recorded at birth, age 1, and age 3. Maternal smoking status was recorded at the time of birth.

Table 5: Child weight (grams) at birth and age 3 by maternal smoking status at birth.
Smoking status N Mean birthweight SD of birthweight Mean weight at age 3 SD of weight at age 3
Never 1618 3233 269 14308 1259
Former 97 3113 303 14524 1209
Current 218 2894 255 15631 1310
NA 25 3217 270 14254 1317

Figure 1: Child weight (grams) by maternal smoking status.

Interpretation: Children of current smokers showed lower weights compared to never smokers, there being a mean difference at birth of 339g. Former smokers’ children displayed intermediate weights at birth, suggesting partial recovery from smoking effects.

The weight gap between current and never smokers narrowed at age 3. All groups showed similar variability in weights, though current smokers had slightly wider distributions and conversely became had a heavier mean weight than never smokers by 51g. The consistent pattern across time points indicates lasting effects of maternal smoking on child growth trajectories.

The distribution of mothers age

Table 6: Summary statistics for mother’s age at birth

N Mean SD Minimum Maximum
1958 29.8 5.1 18 50

Figure 2: Comparison of observed mother’s age distribution with simulated distributions

Distribution Comparison:: The observed distribution of mothers’ ages (mean = 29.8, SD = 5.1) shows a slightly right-skewed pattern, with most mothers clustered between 25-35 years. Key observations:

Observed Data (Blue):

Normal Simulation (Orange):

Uniform Simulation (Red):

Conclusion: Neither theoretical distribution perfectly matches the observed data. The normal distribution captures the central tendency but fails to account for the right skew. The uniform distribution is clearly inappropriate. The observed distribution most closely resembles a right-skewed normal distribution, suggesting that while most mothers are concentrated around 30 years, there’s a longer tail of older mothers than younger ones. For modeling purposes, a normal distribution might serve as a reasonable approximation, though a skewed distribution would better capture the actual pattern.