Introduction

We are again analyzing data from Kinedu, an app for parents of young children to do activities with their children.

The particular data we look at here are a set of 15633938 binary answers given the first time parents saw a set of 297 “indicators” - questions that the app asked to help with developmental categorization.

Approach and goals

Our goal in this report is to understand the clustering of developmental milestones. In particular, there are several kinds of analyses that we’d like to look at.

  • Clustering of milestones. Find milestones that follow the same general trajectory across ages, e.g. things that are generally happening at the same time. Some of these will be within milestone categories, others (perhaps the more interesting ones) between categories.

  • Clustering of individuals. We don’t know enough about individuals actually to cluster them (for the most part), but we can look at the predictive relationships between particular answers (controlling for age), e.g. people who answer this question affirmatively are also more likely to answer this other one affirmatively. This approach is likely quite confounded by response biases, so we’d need to model that as well.

  • Dimensionality reduction. Identify the principal components of variance in responding across questions. A lot of this will likely be age-related, but perhaps there are other secondary components that are interesting.

Remarks on data

One thing to look at here is the data we have.

Missing data is clearly a huge problem, as is sample size in each cell, as can be seen by looking at the same matrix, this time excluding cells with fewer than 20 entries.

So at a certain point we will need to do some significant interpolation of missing values, probably using a model. Otherwise, our missing data problem (as well as the sparse cells at the younger and older ages) will cause a lot of spurious findings.

Interpolation of means across ages

We follow the approach of fitting curves independently and using them to interpolate missing data. Let’s plot some empirical curves first. Red is weighted loess; blue is weighted polynomial logistic. I experimented with degrees up to 4 but found that 3 worked acceptably.

But honestly, loess looks beter overall. So let’s get interpolated curve fits for loess. For one indicator.

Now do this for every indicator

Clustering of age trajectories

We look at these clusters first for raw data and then for interpolated.

Raw data

Let’s follow the approach of computing correlations between average trajectories.

Look at the distribution of correlations, both within and across categories.

This is interesting, but clearly too much information. Need to summarize for greater ease of reading.

From this plot it’s clear that within-category correlations are substantially higher than between-cateogry correlations.

Curve interpolated data

The interpolated data is cleaner but the ordering is (comfortingly) not that different.

Individual cross-category matches

What are some of the best-matching curves across categories?

There are clearly some trajectories that are almost perfectly correlated across categories. Let’s get some of these pairs.

##    base_id target_id       cor
## 1      232       181 0.9998498
## 2      181       232 0.9998498
## 3      232         4 0.9998361
## 4        4       232 0.9998361
## 5      200       113 0.9998222
## 6      113       200 0.9998222
## 7      219       119 0.9997564
## 8      119       219 0.9997564
## 9      108        47 0.9997480
## 10      47       108 0.9997480
##                                                                                          base_desc
## 1                                                          Stays balanced when standing or walking
## 2  Coordinates movements while walking by alternating his/her feet to get from one side to another
## 3                                                          Stays balanced when standing or walking
## 4                                                                                     While walkin
## 5                                                               Begins to have a coordinated crawl
## 6                                              Holds on to furniture to stand up and stay standing
## 7                      Points and tries to say something in order to obtain an object he/she wants
## 8                                                                   Stands and sits by him/herself
## 9                               Repeats actions that get peoples attention or are funny for others
## 10                                                        Turns hands to open and close containers
##                base_cat
## 1              standing
## 2                  walk
## 3              standing
## 4                  walk
## 5                 crawl
## 6              standing
## 7             point out
## 8              standing
## 9  social and emotional
## 10           manipulate
##                                                                                        target_desc
## 1  Coordinates movements while walking by alternating his/her feet to get from one side to another
## 2                                                          Stays balanced when standing or walking
## 3                                                                                     While walkin
## 4                                                          Stays balanced when standing or walking
## 5                                              Holds on to furniture to stand up and stay standing
## 6                                                               Begins to have a coordinated crawl
## 7                                                                   Stands and sits by him/herself
## 8                      Points and tries to say something in order to obtain an object he/she wants
## 9                                                         Turns hands to open and close containers
## 10                              Repeats actions that get peoples attention or are funny for others
##              target_cat same_cat
## 1                  walk    FALSE
## 2              standing    FALSE
## 3                  walk    FALSE
## 4              standing    FALSE
## 5              standing    FALSE
## 6                 crawl    FALSE
## 7              standing    FALSE
## 8             point out    FALSE
## 9            manipulate    FALSE
## 10 social and emotional    FALSE

and plot.

So it’s clear a lot of these things are correlated just because they are going up…

Hierarchical clustering.

A next step is to try and cluster by correlations. Note here we’re still working with the interpolated data.