Note

This is a live document, which will be updated as the project progresses. You can expect to find additional sections as the analysis are finalised and points of interest arise.

In its initial version, this document contains insights on some data issues related to AFS’ Stockport project.

For any questions or comments, please feel free to get in touch via mail.

The data sources

The following table contains information about the datasets that have been made available by Stockport Council for the project. The reference_id column is for internal use in this report.

reference_id	file_name	details
I	hv_output_final.csv	this dataset contains the ASQ and WellComm scores, separated by domain and time period
II	hv_activities.csv	the consultations and activities (e.g. the developmental reviews, HENRY, babbling babies) which occurred during the time period
III	hv_interventions.csv	the list of recommended interventions, including interventions such as bookstart or big book of ideas
IV	hv_intervention_obs.csv	the interventions the children attended
V	hv_person.csv	the demographic information including LSOA, gender and ethnicity
VI	pp_output.csv	whether a child is flagged as pupil premium or not (identified from when the children in the dataset reach reception age)
VII	cin_output.csv	whether a child has been flagged as a child in need
VIII	eha_ouput.csv	whether a child has had an early help assessment or not
IX	hv_referrals.csv	if a child has been referred to an additional service (e.g. a podiatrist)
X	hv_services_output.csv	the service attended e.g. CSALT, Health Visiting, Children’s Physiotherapy
XI	stockport_start_well_stats.csv	unused
XII	tyt_output.csv	two year old take-up of the health review

The above records span a period of time between October 2015 and March 2021. Further data might be made available.

Missing data

We have carried out a an analysis of missing information, complemented by a discussion with the domain experts. The table below presents an overview of the missingness patterns, and their likely cause. Percentages indicate amount of missing data per variable.

reference_id	missingness	likely_cause
I	more recent record heavily missing on a number of variables	not available at the time of data extraction
II	person_id 1.93%	unclear, but likely negligible
III	complete data
IV	person_id 4.36%, gestation 95.73%	Stockport have only just started recording gestation, therefore it is mainly missing
V	from Neonatal.Unit on missing to various degrees	possibly like the above
VI	PP 85.07%, Year 85.07%	pupil premium (PP) is missing for a high percentage as the children have to have reached reception age for them to be recorded as pupil premium; therefore we only have the status for the oldest children who have now reached reception
VII	ref_end 17.1%	a lot of the referrals are ongoing, therefore they haven’t yet ended
VIII	eha_episode_end 22.91%	right censoring
IX	complete data
X	InboundCloseDate 84.58%	right censoring
XI	complete data
XII	TwoYearTakeUp 85.06%, Year 85.06%	we are still working on this

Below is an example visualisation of missigness in the dataset V, which is another helpful tool to improve the overview of patterns. We also make use of more sophisticated visualisations that explicitly link pair- and group-wise missingness patterns and can be used to inform imputation efforts (how to best fill in gaps).

Pass/uncertain/fail state

Each assessment can result in a black (fail), grey (uncertain) and white (pass) outcome, that is stored in the asm_state variable. Ideally, these are determined by contrasting the questionnaire score with a provided cut-off threshold, which is specific to the sub-domain considered and can vary according to other factors (see next section for a discussion on that).

In practice, health visitors have some degree of discretion in awarding the state. In the interest of quantifying how often the professional’s call would differ from the simple comparison score vs cut-off, the dataset also contains a variable called asm_state_cal that contains the automatically determined state based solely on scores and thresholds.

The table below highlights the existing discrepancy across all ASQ-3 questionnaires and measurement occasions (note it might be interesting to stratify this analysis further by factors such as, for example, the type of questionnaire, sub-domain or age). Rows refer to health visitors-determined states, columns are the automated ones.

##        
##          black   grey  white
##   black   5577    277     42
##   grey     627   8782    294
##   white    291   1851 119787

same table but with percentages

##        
##           black    grey   white
##   black  4.0552  0.2014  0.0305
##   grey   0.4559  6.3856  0.2138
##   white  0.2116  1.3459 87.1001

Overall, there seems to be a good agreement, although some degree of discrepancy is observed, in particular around the cut-offs.

Reviews

The focus here is on reviews, i.e., when any ASQ-3 questionnaire is administered again for the same time period. Domain knowledge suggests that

reviews are typically carried out when at least one of the sub-domain scores does not obtain a pass, but the health visitor feels the kid is almost there
reviews are typically carried out ~3 months after the first assessment
not passing on one or more sub-domains should not warrant administering the whole questionnaire again.

In light of this, it is of interest to understand whether the above expectations on reviews find confirmation in the Stockport data. Specifically, we want to know about

the overall number of reviews
the time gap between assessment and review
whether differences are observable by:
1. where (in Stockport)
2. specific sub-domain
3. score levels.

1. How many reviews?

Note: There seem to be discrepancies (114 records) if we identify duplicates via ParObTerm rather than asmt+period. I’ll use with the latter, for now.

There appear to be ~ 7.37%, or 983 out of 13342 unique identifiers presenting at least one review. Of these, 50 have at least 2, and 3 have 3.

These result in a total of 6310 duplicate records out of 147268, or ~4.28% of the total.

The following table contains the counts of duplicate records by type of assessment (comm to prob refer to ASQ-3, SE stands for social-emotional).

##           
##                0     1     2     3
##   comm     26559  1148    45     0
##   fine     26238   826    21     0
##   gros     27274  1118    30     0
##   pers     26201   978    36     0
##   prob     26153   946    36     0
##   SE        6457   250    15     0
##   WellComm  2382   444    99    12

or, in relative terms (row-wise conditional distributions)

##           
##                0     1     2     3
##   comm     95.70  4.14  0.16  0.00
##   fine     96.87  3.05  0.08  0.00
##   gros     95.96  3.93  0.11  0.00
##   pers     96.27  3.59  0.13  0.00
##   prob     96.38  3.49  0.13  0.00
##   SE       96.06  3.72  0.22  0.00
##   WellComm 81.10 15.12  3.37  0.41

Notably, the WellComm assessments present a higher proportion of duplicates, if compared with the others.

An overall view of the average ASQ-3 score over time (in days, between 2015-10-08, 2021-03-30), by sub-domain (comm, fine, gros, pers, prob) and among individuals with:

no review (red)
yes reviews, first assessment scores (black)
yes reviews, subsequent assessments scores (black)

The dashed blue line denotes the average threshold for a PASS over time (update on this below). The grey bands reflect the amount of information at each time point available to estimate the moving average. What this representation does not show, is variations by stage of child development: at each time point, all test scores for that sub-domain around that date for all kids are averaged. Some might be taking the 2 months, other the 36 months questionnaire.

The plot above can be ideally used to gain insight on temporal trends in scores over the study window, by sub-domain and situation (never had a review, at least one review). For example, looking at those that did not need a review (in red): the evidence seems to suggest an overall constant average score on comm, pers and prob, while the average score for fine showed a decrease, and that for gros an increase over time. Again, different stages of development are pooled together.

Another potentially useful view of the data can be obtained by looking at time as defined by measurement occasions (at 2 months, 18 months, etc), rather than by calendar date. In this way, the measurements for the cohort can be represented over a timescale that has a common “time zero” for everyone, which coincides with their first measurement (usually at 2 months). This overcomes the problem with mixed developmental stages, but pools together questionnaires that are not contemporary (which is not a problem per se, but would need to be taken into account if interested in evaluating a population intervention at a given point in time). The plot below has the same structure as the previous one, with the only distinction being the time scale.

Once again, the thresholds do not appear to be constant, which needs to be investigated (possibly an artifact of how I am grouping?). The increasingly wider grey bands suggest less information is available regarding assessment for older children. I will not delve into intepretation of these plots, it was just to show what could be done. The next graph ignores the stratification by review status, and could be used to provide an overall assessment in the study window of trends in development, as measured by ASQ-3, by sub-domain.

[Update]: it was confirmed that the thresholds are, in fact, not meant to be constant across sub-domains. From a recent mail exchange with the Stockport group:

“The cut-off point for all ASQs is an empirically derived score that indicates the point that a child’s developmental performance begins to appear suspect and a further assessment/referral may be required.

According to ASQ “ A standard cut-off point, or referral criterion, for each age interval has been determined statistically using data from 3000 questionnaires. The cut-off points were derived using a variety of best-fit measures and employed to obtain an ideal balance between over-referral and under-referral to maximise sensitivity and specificity”

So it is based on standard deviation of a large sample.”

Which is useful, although the question of how the cut-offs are actually computed (“using a variety of best-fit measures… etc”) remains open, as well as how much they are meant to vary over time, within the same sub-domain.

2. Time gap

The existence of the reviewing process, as well as the fact that we are looking at a complex system over time, warrants at least a look into how regular/adhering to what planned the assessment are. The following graph is a visualisation that I have developed for a research work on breast cancer screening programmes, that I feel would be very useful here.

For each individual and sub-domain, I compute the time distance (here in months, could in principle be any unit of time) between the first assessment within the category, and all the subsequent. In this way, I obtain a set of observations whose frequency distribution can then be visualised by means of, for example, a histogram. The plot below contains the distribution of time from first assessment in months, by sub-domain. Note that I am excluding the zeros, as they would represent the distance of the first assessment (usually at 2 months of age) from itself. The fact that there seems to be some frequency mass at 0, is simply a graphical artifact due to how binning is done.

This visualisation is aimed at investigating the overall timing of the “assessment machine”. The evidence of a cyclic nature in when most of the ASQ-3 questionnaires are administered is to be expected, given the planned timetable, as is some variability around that (questionnaires cannot quite possibly be administered at exactly the same intervals of time to everybody). What this plot also suggests, which was also highlighted by the increasingly wide grey uncertainty bands in the score trends in the previous section, is that we have less and less measurements as we look at older children.

It is possible to “zoom in”, if desired, and look more closely at sub-domain specifically; the following plot, for example, refers to comm alone, and I have highlighted (in red, dashed vertical lines) the expected times of measurement.

Stockport - reviews analysis

Federico Andreis

January 2022

Note

The data sources

Missing data

Pass/uncertain/fail state

Reviews

1. How many reviews?

2. Time gap

3. Do we see any difference in number of reviews when stratifying by

a) where (in Stockport)?

b) which sub-domain were not a pass?

c) the score levels?