This is a live document, which will be updated as the project progresses. You can expect to find additional sections as the analysis are finalised and points of interest arise.
In its initial version, this document contains insights on some data issues related to AFS’ Stockport project.
For any questions or comments, please feel free to get in touch via mail.
The following table contains information about the datasets that have been made available by Stockport Council for the project. The reference_id
column is for internal use in this report.
reference_id | file_name | details |
---|---|---|
I | hv_output_final.csv | this dataset contains the ASQ and WellComm scores, separated by domain and time period |
II | hv_activities.csv | the consultations and activities (e.g. the developmental reviews, HENRY, babbling babies) which occurred during the time period |
III | hv_interventions.csv | the list of recommended interventions, including interventions such as bookstart or big book of ideas |
IV | hv_intervention_obs.csv | the interventions the children attended |
V | hv_person.csv | the demographic information including LSOA, gender and ethnicity |
VI | pp_output.csv | whether a child is flagged as pupil premium or not (identified from when the children in the dataset reach reception age) |
VII | cin_output.csv | whether a child has been flagged as a child in need |
VIII | eha_ouput.csv | whether a child has had an early help assessment or not |
IX | hv_referrals.csv | if a child has been referred to an additional service (e.g. a podiatrist) |
X | hv_services_output.csv | the service attended e.g. CSALT, Health Visiting, Children’s Physiotherapy |
XI | stockport_start_well_stats.csv | unused |
XII | tyt_output.csv | two year old take-up of the health review |
The above records span a period of time between October 2015 and March 2021. Further data might be made available.
We have carried out a an analysis of missing information, complemented by a discussion with the domain experts. The table below presents an overview of the missingness patterns, and their likely cause. Percentages indicate amount of missing data per variable.
reference_id | missingness | likely_cause |
---|---|---|
I | more recent record heavily missing on a number of variables | not available at the time of data extraction |
II | person_id 1.93% | unclear, but likely negligible |
III | complete data | |
IV | person_id 4.36%, gestation 95.73% | Stockport have only just started recording gestation, therefore it is mainly missing |
V | from Neonatal.Unit on missing to various degrees | possibly like the above |
VI | PP 85.07%, Year 85.07% | pupil premium (PP) is missing for a high percentage as the children have to have reached reception age for them to be recorded as pupil premium; therefore we only have the status for the oldest children who have now reached reception |
VII | ref_end 17.1% | a lot of the referrals are ongoing, therefore they haven’t yet ended |
VIII | eha_episode_end 22.91% | right censoring |
IX | complete data | |
X | InboundCloseDate 84.58% | right censoring |
XI | complete data | |
XII | TwoYearTakeUp 85.06%, Year 85.06% | we are still working on this |
Below is an example visualisation of missigness in the dataset V
, which is another helpful tool to improve the overview of patterns. We also make use of more sophisticated visualisations that explicitly link pair- and group-wise missingness patterns and can be used to inform imputation efforts (how to best fill in gaps).
Each assessment can result in a black
(fail), grey
(uncertain) and white
(pass) outcome, that is stored in the asm_state
variable. Ideally, these are determined by contrasting the questionnaire score with a provided cut-off threshold, which is specific to the sub-domain considered and can vary according to other factors (see next section for a discussion on that).
In practice, health visitors have some degree of discretion in awarding the state. In the interest of quantifying how often the professional’s call would differ from the simple comparison score vs cut-off, the dataset also contains a variable called asm_state_cal
that contains the automatically determined state based solely on scores and thresholds.
The table below highlights the existing discrepancy across all ASQ-3 questionnaires and measurement occasions (note it might be interesting to stratify this analysis further by factors such as, for example, the type of questionnaire, sub-domain or age). Rows refer to health visitors-determined states, columns are the automated ones.
##
## black grey white
## black 5577 277 42
## grey 627 8782 294
## white 291 1851 119787
same table but with percentages
##
## black grey white
## black 4.0552 0.2014 0.0305
## grey 0.4559 6.3856 0.2138
## white 0.2116 1.3459 87.1001
Overall, there seems to be a good agreement, although some degree of discrepancy is observed, in particular around the cut-offs.
The focus here is on reviews, i.e., when any ASQ-3 questionnaire is administered again for the same time period. Domain knowledge suggests that
reviews are typically carried out when at least one of the sub-domain scores does not obtain a pass, but the health visitor feels the kid is almost there
reviews are typically carried out ~3 months after the first assessment
not passing on one or more sub-domains should not warrant administering the whole questionnaire again.
In light of this, it is of interest to understand whether the above expectations on reviews find confirmation in the Stockport data. Specifically, we want to know about
the overall number of reviews
the time gap between assessment and review
whether differences are observable by:
Note: There seem to be discrepancies (114 records) if we identify duplicates via ParObTerm
rather than asmt+period
. I’ll use with the latter, for now.
There appear to be ~ 7.37%, or 983 out of 13342 unique identifiers presenting at least one review. Of these, 50 have at least 2, and 3 have 3.
These result in a total of 6310 duplicate records out of 147268, or ~4.28% of the total.
The following table contains the counts of duplicate records by type of assessment (comm to prob refer to ASQ-3, SE stands for social-emotional).
##
## 0 1 2 3
## comm 26559 1148 45 0
## fine 26238 826 21 0
## gros 27274 1118 30 0
## pers 26201 978 36 0
## prob 26153 946 36 0
## SE 6457 250 15 0
## WellComm 2382 444 99 12
or, in relative terms (row-wise conditional distributions)
##
## 0 1 2 3
## comm 95.70 4.14 0.16 0.00
## fine 96.87 3.05 0.08 0.00
## gros 95.96 3.93 0.11 0.00
## pers 96.27 3.59 0.13 0.00
## prob 96.38 3.49 0.13 0.00
## SE 96.06 3.72 0.22 0.00
## WellComm 81.10 15.12 3.37 0.41
Notably, the WellComm assessments present a higher proportion of duplicates, if compared with the others.
An overall view of the average ASQ-3 score over time (in days, between 2015-10-08, 2021-03-30), by sub-domain (comm
, fine
, gros
, pers
, prob
) and among individuals with:
The dashed blue line denotes the average threshold for a PASS over time (update on this below). The grey bands reflect the amount of information at each time point available to estimate the moving average. What this representation does not show, is variations by stage of child development: at each time point, all test scores for that sub-domain around that date for all kids are averaged. Some might be taking the 2 months, other the 36 months questionnaire.
The plot above can be ideally used to gain insight on temporal trends in scores over the study window, by sub-domain and situation (never had a review, at least one review). For example, looking at those that did not need a review (in red): the evidence seems to suggest an overall constant average score on comm
, pers
and prob
, while the average score for fine
showed a decrease, and that for gros
an increase over time. Again, different stages of development are pooled together.
Another potentially useful view of the data can be obtained by looking at time as defined by measurement occasions (at 2 months, 18 months, etc), rather than by calendar date. In this way, the measurements for the cohort can be represented over a timescale that has a common “time zero” for everyone, which coincides with their first measurement (usually at 2 months). This overcomes the problem with mixed developmental stages, but pools together questionnaires that are not contemporary (which is not a problem per se, but would need to be taken into account if interested in evaluating a population intervention at a given point in time). The plot below has the same structure as the previous one, with the only distinction being the time scale.
Once again, the thresholds do not appear to be constant, which needs to be investigated (possibly an artifact of how I am grouping?). The increasingly wider grey bands suggest less information is available regarding assessment for older children. I will not delve into intepretation of these plots, it was just to show what could be done. The next graph ignores the stratification by review status, and could be used to provide an overall assessment in the study window of trends in development, as measured by ASQ-3, by sub-domain.
[Update]: it was confirmed that the thresholds are, in fact, not meant to be constant across sub-domains. From a recent mail exchange with the Stockport group:
“The cut-off point for all ASQs is an empirically derived score that indicates the point that a child’s developmental performance begins to appear suspect and a further assessment/referral may be required.
According to ASQ “ A standard cut-off point, or referral criterion, for each age interval has been determined statistically using data from 3000 questionnaires. The cut-off points were derived using a variety of best-fit measures and employed to obtain an ideal balance between over-referral and under-referral to maximise sensitivity and specificity”
So it is based on standard deviation of a large sample.”
Which is useful, although the question of how the cut-offs are actually computed (“using a variety of best-fit measures… etc”) remains open, as well as how much they are meant to vary over time, within the same sub-domain.
The existence of the reviewing process, as well as the fact that we are looking at a complex system over time, warrants at least a look into how regular/adhering to what planned the assessment are. The following graph is a visualisation that I have developed for a research work on breast cancer screening programmes, that I feel would be very useful here.
For each individual and sub-domain, I compute the time distance (here in months, could in principle be any unit of time) between the first assessment within the category, and all the subsequent. In this way, I obtain a set of observations whose frequency distribution can then be visualised by means of, for example, a histogram. The plot below contains the distribution of time from first assessment
in months, by sub-domain. Note that I am excluding the zeros, as they would represent the distance of the first assessment (usually at 2 months of age) from itself. The fact that there seems to be some frequency mass at 0, is simply a graphical artifact due to how binning is done.
This visualisation is aimed at investigating the overall timing of the “assessment machine”. The evidence of a cyclic nature in when most of the ASQ-3 questionnaires are administered is to be expected, given the planned timetable, as is some variability around that (questionnaires cannot quite possibly be administered at exactly the same intervals of time to everybody). What this plot also suggests, which was also highlighted by the increasingly wide grey uncertainty bands in the score trends in the previous section, is that we have less and less measurements as we look at older children.
It is possible to “zoom in”, if desired, and look more closely at sub-domain specifically; the following plot, for example, refers to comm
alone, and I have highlighted (in red, dashed vertical lines) the expected times of measurement.
This will be part of the analysis to be completed between February-March.