1 Setup

Load the data, and visualise quickly:

rm( list = ls() )
require( ggplot2 )
require( gridExtra )
require( reshape2 )
require( RColorBrewer )
d <- read.csv("./StudentsDataSets/raw_DTC.csv", stringsAsFactors = FALSE)
(d)

2 Problem

The problem formulation was :

In this data, the response (dependent) variables on the task were paired as:

cogDTC_num vs gaitDTC_num, cogDTC_lit vs gaitDTC_lit

And:

With the aim to establish participants that have lower cogDTC and higher gaitDTC for each of numeracy and literacy (suggesting they are abnormally prioritising their cognitive performance over their gait).

To paraphrase, we are seeking to find meaningful cutoffs that allow us to label participants:

“Posture First” - those participants that prioritise maintaining posture in the gait task over cognitive performance
“Posture Second” - those participants that prioritise maintaining cognitive performance over gait

With “Posture Second” being the abnormal group.

3 Quick Visualisations

Visualise the univariate distributions:

d.m <- melt( d )

## No id variables; using all as measure variables

ggplot( d.m, aes( x = value, fill = variable ) ) +
  geom_histogram( position = "identity", alpha = 0.5, bins = 10 ) +
  scale_fill_brewer(palette = "Set1") +
  theme_classic() +
  facet_wrap( ~ variable, scales = "free" )

## Warning: Removed 2 rows containing non-finite values (stat_bin).

Observations:

We see that cogDTC_num is fairly well behaved in the sense of being normally distributed.
The remainder, not so much.
Note that for the variables gaitDTC_num and cogDTC_lit, there is a floor effect.

So deciding on cutoffs on an assumption of normality (for example, by taking those values which are +/- 1 or 2 standard deviations) is not going to be robust – in the sense it’s not data driven – however, if we had strong theoretical grounds, we might be able to make a case.

4 Reworking the Visualisation

Let’s scatterplot the joint variables, with quantiles (median, IQR etc.) to see if this offers some insight into how we might decide on relevant cutoffs

# a table of quantiles for each variable
quant.tab <- data.frame( apply( d, 2, quantile, na.rm = TRUE ) )

p.num <- ggplot(d) +
  annotate("rect", xmin=quant.tab$gaitDTC_num[2],  xmax=quant.tab$gaitDTC_num[4], 
                   ymin=quant.tab$cogDTC_num[2],   ymax=quant.tab$cogDTC_num[4], fill="blue", alpha=0.2 ) +
  geom_hline( yintercept = quant.tab$cogDTC_num[3], colour = "black" ) +
  geom_vline( xintercept = quant.tab$gaitDTC_num[3], colour = "black" ) +
  geom_point( aes( x = gaitDTC_num, y = cogDTC_num ), colour = "blue", size = 3 ) + 
  theme_classic()

p.lit <- ggplot(d) +
  annotate("rect", xmin=quant.tab$gaitDTC_lit[2],  xmax=quant.tab$gaitDTC_lit[4], 
                   ymin=quant.tab$cogDTC_lit[2], ymax=quant.tab$cogDTC_lit[4], fill="purple", alpha=0.2 ) +
  geom_hline( yintercept = quant.tab$cogDTC_lit[3], colour = "black" ) +
  geom_vline( xintercept = quant.tab$gaitDTC_lit[3], colour = "black" ) +
  geom_point( aes( x = gaitDTC_lit, y = cogDTC_lit ), colour = "purple", size = 3 ) + 
  theme_classic()

grid.arrange( p.num, p.lit, ncol = 2)

## Warning: Removed 1 rows containing missing values (geom_point).

## Warning: Removed 1 rows containing missing values (geom_point).

The left plot is the numeric task, and the right panel the literacy task. For both, the black lines show the median of each variable, and the red boxes the inter-quartile ranges.

5 Next steps : Cutoffs/Thresholds

So we need to decide on thresholds – and since we have taken a non-parametric, data-driven approach thus far, we might propose the following rules:

Posture First : participants with gaitGTC <= median of gaitGTC and cogDTC >= median of cogDTC
Posture Second : participants with gaitGTC >= median of gaitGTC and cogDTC <= median of cogDTC

Which results in:

p.num.t <- ggplot(d) +
  annotate("rect", xmin=quant.tab$gaitDTC_num[3],  xmax=Inf, 
                   ymin=-Inf,   ymax=quant.tab$cogDTC_num[3], fill="blue", alpha=0.2 ) +
  geom_hline( yintercept = quant.tab$cogDTC_num[3], colour = "black" ) +
  geom_vline( xintercept = quant.tab$gaitDTC_num[3], colour = "black" ) +
  geom_point( aes( x = gaitDTC_num, y = cogDTC_num ), colour = "blue", size = 3 ) + 
  theme_classic()

p.lit.t <- ggplot(d) +
  annotate("rect", xmin=quant.tab$gaitDTC_lit[3],  xmax=Inf, 
                   ymin=-Inf, ymax=quant.tab$cogDTC_lit[3], fill="purple", alpha=0.2 ) +
  geom_hline( yintercept = quant.tab$cogDTC_lit[3], colour = "black" ) +
  geom_vline( xintercept = quant.tab$gaitDTC_lit[3], colour = "black" ) +
  geom_point( aes( x = gaitDTC_lit, y = cogDTC_lit ), colour = "purple", size = 3 ) + 
  theme_classic()

grid.arrange( p.num.t, p.lit.t, ncol = 2)

## Warning: Removed 1 rows containing missing values (geom_point).

## Warning: Removed 1 rows containing missing values (geom_point).

Finally, label up the participants with “Posture Second” behaviours

d$PostureSecond.Num <- ifelse( d$gaitDTC_num > quant.tab$gaitDTC_num[3] & d$cogDTC_num < quant.tab$cogDTC_num[3], 1, 0 )
d$PostureSecond.Lit <- ifelse( d$gaitDTC_lit > quant.tab$gaitDTC_lit[3] & d$cogDTC_lit < quant.tab$cogDTC_lit[3], 1, 0 )

And plot just to check

p.num.t <- ggplot(d) +
  annotate("rect", xmin=quant.tab$gaitDTC_num[3],  xmax=Inf, 
                   ymin=-Inf,   ymax=quant.tab$cogDTC_num[3], fill="blue", alpha=0.2 ) +
  geom_hline( yintercept = quant.tab$cogDTC_num[3], colour = "black" ) +
  geom_vline( xintercept = quant.tab$gaitDTC_num[3], colour = "black" ) +
  geom_point( aes( x = gaitDTC_num, y = cogDTC_num ), colour = "blue", size = 3 ) + 
  geom_point( data = d[ which( d$PostureSecond.Num == 1), ], aes( x = gaitDTC_num, y = cogDTC_num ), colour = "blue", shape = 1, size = 8 ) + 
  theme_classic()

p.lit.t <- ggplot(d) +
  annotate("rect", xmin=quant.tab$gaitDTC_lit[3],  xmax=Inf, 
                   ymin=-Inf, ymax=quant.tab$cogDTC_lit[3], fill="purple", alpha=0.2 ) +
  geom_hline( yintercept = quant.tab$cogDTC_lit[3], colour = "black" ) +
  geom_vline( xintercept = quant.tab$gaitDTC_lit[3], colour = "black" ) +
  geom_point( aes( x = gaitDTC_lit, y = cogDTC_lit ), colour = "purple", size = 3 ) + 
  geom_point( data = d[ which( d$PostureSecond.Lit == 1), ], aes( x = gaitDTC_lit, y = cogDTC_lit ), colour = "purple", shape = 1, size = 8 ) + 
  theme_classic()

grid.arrange( p.num.t, p.lit.t, ncol = 2)

## Warning: Removed 1 rows containing missing values (geom_point).

## Warning: Removed 1 rows containing missing values (geom_point).

For reference, the table of data:

(d)

And we have :

Posture Second in the numerical task, a total of 5 participants
Posture Second in the literacy task, a total of 6 participants

Distributions and Cut-offs

1 Setup

2 Problem

3 Quick Visualisations

4 Reworking the Visualisation

5 Next steps : Cutoffs/Thresholds