Load the data, and visualise quickly:
rm( list = ls() )
require( ggplot2 )
require( gridExtra )
require( reshape2 )
require( RColorBrewer )
d <- read.csv("./StudentsDataSets/raw_DTC.csv", stringsAsFactors = FALSE)
(d)
The problem formulation was :
In this data, the response (dependent) variables on the task were paired as:
cogDTC_num vs gaitDTC_num, cogDTC_lit vs gaitDTC_lit
And:
With the aim to establish participants that have lower cogDTC and higher gaitDTC for each of numeracy and literacy (suggesting they are abnormally prioritising their cognitive performance over their gait).
To paraphrase, we are seeking to find meaningful cutoffs that allow us to label participants:
With “Posture Second” being the abnormal group.
Visualise the univariate distributions:
d.m <- melt( d )
## No id variables; using all as measure variables
ggplot( d.m, aes( x = value, fill = variable ) ) +
geom_histogram( position = "identity", alpha = 0.5, bins = 10 ) +
scale_fill_brewer(palette = "Set1") +
theme_classic() +
facet_wrap( ~ variable, scales = "free" )
## Warning: Removed 2 rows containing non-finite values (stat_bin).
Observations:
cogDTC_num
is fairly well behaved in the sense of being normally distributed.gaitDTC_num
and cogDTC_lit
, there is a floor effect.So deciding on cutoffs on an assumption of normality (for example, by taking those values which are +/- 1 or 2 standard deviations) is not going to be robust – in the sense it’s not data driven – however, if we had strong theoretical grounds, we might be able to make a case.
Let’s scatterplot the joint variables, with quantiles (median, IQR etc.) to see if this offers some insight into how we might decide on relevant cutoffs
# a table of quantiles for each variable
quant.tab <- data.frame( apply( d, 2, quantile, na.rm = TRUE ) )
p.num <- ggplot(d) +
annotate("rect", xmin=quant.tab$gaitDTC_num[2], xmax=quant.tab$gaitDTC_num[4],
ymin=quant.tab$cogDTC_num[2], ymax=quant.tab$cogDTC_num[4], fill="blue", alpha=0.2 ) +
geom_hline( yintercept = quant.tab$cogDTC_num[3], colour = "black" ) +
geom_vline( xintercept = quant.tab$gaitDTC_num[3], colour = "black" ) +
geom_point( aes( x = gaitDTC_num, y = cogDTC_num ), colour = "blue", size = 3 ) +
theme_classic()
p.lit <- ggplot(d) +
annotate("rect", xmin=quant.tab$gaitDTC_lit[2], xmax=quant.tab$gaitDTC_lit[4],
ymin=quant.tab$cogDTC_lit[2], ymax=quant.tab$cogDTC_lit[4], fill="purple", alpha=0.2 ) +
geom_hline( yintercept = quant.tab$cogDTC_lit[3], colour = "black" ) +
geom_vline( xintercept = quant.tab$gaitDTC_lit[3], colour = "black" ) +
geom_point( aes( x = gaitDTC_lit, y = cogDTC_lit ), colour = "purple", size = 3 ) +
theme_classic()
grid.arrange( p.num, p.lit, ncol = 2)
## Warning: Removed 1 rows containing missing values (geom_point).
## Warning: Removed 1 rows containing missing values (geom_point).
The left plot is the numeric task, and the right panel the literacy task. For both, the black lines show the median of each variable, and the red boxes the inter-quartile ranges.
So we need to decide on thresholds – and since we have taken a non-parametric, data-driven approach thus far, we might propose the following rules:
gaitGTC
<= median of gaitGTC
and cogDTC
>= median of cogDTC
gaitGTC
>= median of gaitGTC
and cogDTC
<= median of cogDTC
Which results in:
p.num.t <- ggplot(d) +
annotate("rect", xmin=quant.tab$gaitDTC_num[3], xmax=Inf,
ymin=-Inf, ymax=quant.tab$cogDTC_num[3], fill="blue", alpha=0.2 ) +
geom_hline( yintercept = quant.tab$cogDTC_num[3], colour = "black" ) +
geom_vline( xintercept = quant.tab$gaitDTC_num[3], colour = "black" ) +
geom_point( aes( x = gaitDTC_num, y = cogDTC_num ), colour = "blue", size = 3 ) +
theme_classic()
p.lit.t <- ggplot(d) +
annotate("rect", xmin=quant.tab$gaitDTC_lit[3], xmax=Inf,
ymin=-Inf, ymax=quant.tab$cogDTC_lit[3], fill="purple", alpha=0.2 ) +
geom_hline( yintercept = quant.tab$cogDTC_lit[3], colour = "black" ) +
geom_vline( xintercept = quant.tab$gaitDTC_lit[3], colour = "black" ) +
geom_point( aes( x = gaitDTC_lit, y = cogDTC_lit ), colour = "purple", size = 3 ) +
theme_classic()
grid.arrange( p.num.t, p.lit.t, ncol = 2)
## Warning: Removed 1 rows containing missing values (geom_point).
## Warning: Removed 1 rows containing missing values (geom_point).
Finally, label up the participants with “Posture Second” behaviours
d$PostureSecond.Num <- ifelse( d$gaitDTC_num > quant.tab$gaitDTC_num[3] & d$cogDTC_num < quant.tab$cogDTC_num[3], 1, 0 )
d$PostureSecond.Lit <- ifelse( d$gaitDTC_lit > quant.tab$gaitDTC_lit[3] & d$cogDTC_lit < quant.tab$cogDTC_lit[3], 1, 0 )
And plot just to check
p.num.t <- ggplot(d) +
annotate("rect", xmin=quant.tab$gaitDTC_num[3], xmax=Inf,
ymin=-Inf, ymax=quant.tab$cogDTC_num[3], fill="blue", alpha=0.2 ) +
geom_hline( yintercept = quant.tab$cogDTC_num[3], colour = "black" ) +
geom_vline( xintercept = quant.tab$gaitDTC_num[3], colour = "black" ) +
geom_point( aes( x = gaitDTC_num, y = cogDTC_num ), colour = "blue", size = 3 ) +
geom_point( data = d[ which( d$PostureSecond.Num == 1), ], aes( x = gaitDTC_num, y = cogDTC_num ), colour = "blue", shape = 1, size = 8 ) +
theme_classic()
p.lit.t <- ggplot(d) +
annotate("rect", xmin=quant.tab$gaitDTC_lit[3], xmax=Inf,
ymin=-Inf, ymax=quant.tab$cogDTC_lit[3], fill="purple", alpha=0.2 ) +
geom_hline( yintercept = quant.tab$cogDTC_lit[3], colour = "black" ) +
geom_vline( xintercept = quant.tab$gaitDTC_lit[3], colour = "black" ) +
geom_point( aes( x = gaitDTC_lit, y = cogDTC_lit ), colour = "purple", size = 3 ) +
geom_point( data = d[ which( d$PostureSecond.Lit == 1), ], aes( x = gaitDTC_lit, y = cogDTC_lit ), colour = "purple", shape = 1, size = 8 ) +
theme_classic()
grid.arrange( p.num.t, p.lit.t, ncol = 2)
## Warning: Removed 1 rows containing missing values (geom_point).
## Warning: Removed 1 rows containing missing values (geom_point).
For reference, the table of data:
(d)
And we have :