library('dplyr')
library('reshape2')
library('DescTools')
library('xtable')

Spearman’s School data

Spearman’s study of school childrens performance in sensory and academic tests was one basis for IQ theory and often cited as the original correlation analysis and a precursor to factor analysis.

I’ve seen Spearman’s final tables, e.g. for the School IV series, but not the raw data as Spearman shared it in his original publication in 1904.

I’d like more educational data in R, and this seems like a worthwhile dataset, for illustrating some statistics, like correlation, but also issues with reproducibility.

An .html version of Spearman’s study exists online, courtesy of Christopher D. Green of York University, Toronto. The tables of School data are in the Appendix, image files copied from a .pdf:

‘General Intelligence,’ Objectively Determined and Measured (1904), Charles Spearman, American Journal of Psychology 15, 201-293.

A large pdf copy is online courtesy of Niels Waller of the University of Minnesota: Readings: Spearman1904.pdf

There are some issues with the data, examined in this paper:

Spearman’s original computation of g : A model for Burt?, Fancher, Raymond E., British Journal of Psychology, 1985, Vol.76(3), pp.341-352

Unfortunately that’s only available in the academic sphere, to subscribing institutions, but I’ve extracted some of Fancher’s tables below.

What’s the best way to reconstruct Spearman’s remaining data series?

And is it possible to identify any errors and recalculate the correlation tables, to confirm either Spearman or Fancher’s results?

School.I - ‘Village School, 24 Oldest Children’

Spearman’s table of data as published:

Spearman School.I data

Sex <- c('f','m','f','f','m','f','f','f','m','m','f','f'
  ,'f','f','f','m','m','f','m','m','m','m','f','m')

Years <- c(11,12,12,13,11,11,11,13,12,12,12,13,13,12,10,11,10,11,13,12,10,11,11,11)
Months <- c(6,11,8,8,4,11,3,1,5,7,8,10,1,1,6,5,0,9,7,6,4,7,2,2)

Pitch <- c(8,15,14,13,5,25,10,10,18,14,60,20,40,45,33,25,90,17,24,18,70,17,28,90)
Light <- c(4,3,6,4,14,7,19,12,11,30,3,12,5,12,5,4,15,15,26,35,10,42,20,25)
Weight <- c(4,4,4,9,7,4,8,10,9,7,10,10,12,9,15,28,5,20,13,14,14,16,17,18)

Sense.A <- c(6,11,16,1,3,10,8,2,5,21,12,13,4,9,15,17,22,14,19,18,23,24,7,20)
Sense.B <- c(5,7,10,1,2,14,19,4,6,22,9,12,8,13,18,11,21,20,17,3,24,23,15,16)
Clever <- c(2,22,7,1,3,9,12,6,11,19,4,18,8,14,10,17,5,15,24,6,20,23,13,21)
School.I <- data.frame(Sex, Years, Months
                       , Pitch, Light, Weight
                       , Sense.A, Sense.B, Clever
                       , stringsAsFactors = FALSE)

str(School.I) 
## 'data.frame':    24 obs. of  9 variables:
##  $ Sex    : chr  "f" "m" "f" "f" ...
##  $ Years  : num  11 12 12 13 11 11 11 13 12 12 ...
##  $ Months : num  6 11 8 8 4 11 3 1 5 7 ...
##  $ Pitch  : num  8 15 14 13 5 25 10 10 18 14 ...
##  $ Light  : num  4 3 6 4 14 7 19 12 11 30 ...
##  $ Weight : num  4 4 4 9 7 4 8 10 9 7 ...
##  $ Sense.A: num  6 11 16 1 3 10 8 2 5 21 ...
##  $ Sense.B: num  5 7 10 1 2 14 19 4 6 22 ...
##  $ Clever : num  2 22 7 1 3 9 12 6 11 19 ...

Checking

Fancher indicates an issue with the variable Clever, a ranking of Cleverness.

Sorting or tablulating the ranks illustrates the issue:

dfr <- School.I
sort(dfr$Clever) # 2 are ranked 6, 16 is missing
 [1]  1  2  3  4  5  6  6  7  8  9 10 11 12 13 14 15 17 18 19 20 21 22 23
[24] 24
length(dfr$Clever) == length(unique(dfr$Clever)) # does length equal length of unique values?
[1] FALSE
addmargins(xtabs(~Sex+Clever, data = dfr), margin = 1) # a m & f both ranked 6
     Clever
Sex   1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 17 18 19 20 21 22 23 24
  f   1 1 0 1 0 1 1 1 1  1  0  1  1  1  1  0  1  0  0  0  0  0  0
  m   0 0 1 0 1 1 0 0 0  0  1  0  0  0  0  1  0  1  1  1  1  1  1
  Sum 1 1 1 1 1 2 1 1 1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
Desc(as.factor(Clever), data = dfr, plotit = TRUE) # treating rank as factor, not a number
------------------------------------------------------------------------- 
as.factor(Clever) (factor)

  length      n    NAs levels unique  dupes
      24     24      0     23     23      y

   level freq  perc cumfreq cumperc
1      6    2  .083       2    .083
2      1    1  .042       3    .125
3      2    1  .042       4    .167
4      3    1  .042       5    .208
5      4    1  .042       6    .250
6      5    1  .042       7    .292
7      7    1  .042       8    .333
8      8    1  .042       9    .375
9      9    1  .042      10    .417
10    10    1  .042      11    .458
11    11    1  .042      12    .500
12    12    1  .042      13    .542
... etc.
 [list output truncated]

A closer look:

which(dfr$Clever == 6)  # which rows/observations equal 6 for 'Clever'
## [1]  8 20
dfr[Clever == 6, ]      # return those rows where 'Clever' equals 6
##    Sex Years Months Pitch Light Weight Sense.A Sense.B Clever
## 8    f    13      1    10    12     10       2       4      6
## 20   m    12      6    18    35     14      18       3      6

Fancher’s correction

dfr[20,'Clever'] <- 16  # take 'Clever' to be consistent with other ranks & measures

length(dfr$Clever) == length(unique(dfr$Clever)) # ranks are unique
## [1] TRUE

Perhaps the raw data isn’t available because it’s problematic, due to some errors in calculation or reproduction for publication. An attempt at research that isn’t quite reproducible from the published data.

Re-calculating Spearman’s results

Spearman’s results:

Spearman School.I results

Re-calculating Spearman’s results, using -age and cor(method = 'spearman'):

dfr_cor <- dfr %>%
  select(which(sapply(dfr, is.numeric) == TRUE)) %>%
  mutate(Age = (Years + Months/12) # that's how I measure age
         , Age = -Age # no need to rank, cor method = 'spearman' does that, duh!
         ) %>%
  select(Age, seq(3,ncol(dfr)-1)) %>%
  cor(method = 'spearman')

dfr_cor %>%
  round(2)
         Age Pitch Light Weight Sense.A Sense.B Clever
Age     1.00  0.32  0.20   0.21    0.33    0.49   0.01
Pitch   0.32  1.00 -0.02   0.41    0.44    0.42   0.25
Light   0.20 -0.02  1.00   0.30    0.42    0.45   0.47
Weight  0.21  0.41  0.30   1.00    0.31    0.29   0.45
Sense.A 0.33  0.44  0.42   0.31    1.00    0.73   0.65
Sense.B 0.49  0.42  0.45   0.29    0.73    1.00   0.54
Clever  0.01  0.25  0.47   0.45    0.65    0.54   1.00

Issues

The results don’t reproduce quite the same values as Spearman, as may be expected, given the correction.

However, nor do they reproduce Fancher’s results, perhaps due to the way I’ve treated age or other variables.

Fancher’s results Table 2:

Fancher School.I results

How should Age be calculated and evaluated in correlation with other variables?

Presenting it in similar fashion is tricky, e.g. this attempt doesn’t deal with duplication of pairs.

print(
  dfr_cor %>%
    melt(measure.vars = 'value') %>%
    filter(!Var1 == Var2)
  , format = 'html', digits = 2
  , caption = 'Recalculating Spearman')
      Var1    Var2  value
1    Pitch     Age  0.317
2    Light     Age  0.198
3   Weight     Age  0.214
4  Sense.A     Age  0.326
5  Sense.B     Age  0.489
6   Clever     Age  0.010
7      Age   Pitch  0.317
8    Light   Pitch -0.016
9   Weight   Pitch  0.410
10 Sense.A   Pitch  0.438
11 Sense.B   Pitch  0.420
12  Clever   Pitch  0.246
13     Age   Light  0.198
14   Pitch   Light -0.016
15  Weight   Light  0.297
16 Sense.A   Light  0.424
17 Sense.B   Light  0.449
18  Clever   Light  0.469
19     Age  Weight  0.214
20   Pitch  Weight  0.410
21   Light  Weight  0.297
22 Sense.A  Weight  0.307
23 Sense.B  Weight  0.293
24  Clever  Weight  0.445
25     Age Sense.A  0.326
26   Pitch Sense.A  0.438
27   Light Sense.A  0.424
28  Weight Sense.A  0.307
29 Sense.B Sense.A  0.725
30  Clever Sense.A  0.647
31     Age Sense.B  0.489
32   Pitch Sense.B  0.420
33   Light Sense.B  0.449
34  Weight Sense.B  0.293
35 Sense.A Sense.B  0.725
36  Clever Sense.B  0.538
37     Age  Clever  0.010
38   Pitch  Clever  0.246
39   Light  Clever  0.469
40  Weight  Clever  0.445
41 Sense.A  Clever  0.647
42 Sense.B  Clever  0.538

DescTools offers a couple of graphical ways to represent the same matrix of correlations:

PlotCorr(dfr_cor)

PlotWeb(dfr_cor, col=c(hred, hblue))

School.II - ‘Village School, 36 next Oldest Children’

Adding a couple more of Spearman’s School data series:

Sex <- c('m','m','f','m','f','m','m','m','m','f','f','f'
         ,'f','f','m','m','m','f','f','m','m','m','m','f'
         ,'m','m','m','m','m','f','f','m','m','m','m')

Age.Years <- c(9,8,7,8,8,8,9,8,9,7,8,7
               ,9,5,5,10,7,7,6,9,7,8,6,6
               ,7,5,5,6,6,7,6,6,7,5,6)

Age.Months <- c(7,7,3,0,0,1,9,2,0,11,2,2
               ,11,8,7,2,0,1,9,6,10,2,7,10
               ,11,6,11,6,5,0,5,1,3,6,6)

Pitch <- c('6','6','16','24','26','35','35','35','38','42','48','48'
           ,'67','67','70','74','74','77','77','80','80','96','96','104'
           ,'112','112','120','120','>120','>120','>120','>120','>120','>120','>120')

Intellectual.Class <- factor(c(2,2,1,2,2,2,3,1,3,2,2,1
                               ,3,NA,NA,2,NA,1,NA,3,1,2,NA,NA
                               ,2,NA,NA,NA,NA,1,NA,NA,1,NA,NA)
                             , ordered = TRUE)

str(Intellectual.Class)   # reverse this order?
##  Ord.factor w/ 3 levels "1"<"2"<"3": 2 2 1 2 2 2 3 1 3 2 ...
School.II <- data.frame(Sex, Age.Years, Age.Months
                        , Pitch, Intellectual.Class
                        , stringsAsFactors = FALSE)

str(School.II)
## 'data.frame':    35 obs. of  5 variables:
##  $ Sex               : chr  "m" "m" "f" "m" ...
##  $ Age.Years         : num  9 8 7 8 8 8 9 8 9 7 ...
##  $ Age.Months        : num  7 7 3 0 0 1 9 2 0 11 ...
##  $ Pitch             : chr  "6" "6" "16" "24" ...
##  $ Intellectual.Class: Ord.factor w/ 3 levels "1"<"2"<"3": 2 2 1 2 2 2 3 1 3 2 ...

School.III - ‘High Class Preparatory School for Boys’

Sex <- c('f','m','f','f','m','f','f','f'
         ,'m','m','f','f','f','f','f','m'
         ,'m','f','m','m','m','m','f','m')

Age.Years <- c(10,12,11,10,13,12,10,9
               ,12,10,11,10,9,11,11,10
               ,10,10,10,10,12,10,13)

Age.Months <- c(9,4,1,11,7,6,4,5
                ,0,2,2,1,8,10,1,6
                ,8,4,4,1,3,7,3)

Discriminative.Pitch <- c('50','3','10','>60','4','2','4','20
                          ','11','11','24','5','3','5','6','7
                          ','15','11','14','15','7','4','>60')

Discriminative.Light <- c(10,10,10,10,12,10,10,10
                          ,10,12,14,18,18,13,13,14
                          ,19,14,13,13,19,16,19)

Discriminative.Weight <- c(4,6,9,9,5,10,11,11
                          ,12,11,10,7,9,13,13,11
                          ,10,13,18,28,13,16,27)

Rank.Classics <- c(16,5,13,22,1,4,12,23 
                  ,8,3,7,20,10,2,11,17
                  ,21,19,18,15,9,14,6)

Rank.French <- c(19,6,11,23,1,2,14,22 
                  ,8,5,7,15,13,3,12,18
                  ,20,21,16,10,9,17,4)

Rank.English <- c(10,6,11,22,1,2,13,23 
                  ,15,4,7,18,14,3,12,17
                  ,21,9,8,20,16,19,5)

Rank.Mathem. <- c(7,5,13,22,2,1,18,23 
                  ,15,4,6,16,12,3,9,13
                  ,19,21,17,10,11,20,8)
School.III <- data.frame(Age.Years, Age.Months
                         , Discriminative.Pitch
                         , Discriminative.Light
                         , Discriminative.Weight
                         , Rank.Classics
                         , Rank.French
                         , Rank.English
                         , Rank.Mathem.
                         , stringsAsFactors = FALSE)

str(School.III)
## 'data.frame':    23 obs. of  9 variables:
##  $ Age.Years            : num  10 12 11 10 13 12 10 9 12 10 ...
##  $ Age.Months           : num  9 4 1 11 7 6 4 5 0 2 ...
##  $ Discriminative.Pitch : chr  "50" "3" "10" ">60" ...
##  $ Discriminative.Light : num  10 10 10 10 12 10 10 10 10 12 ...
##  $ Discriminative.Weight: num  4 6 9 9 5 10 11 11 12 11 ...
##  $ Rank.Classics        : num  16 5 13 22 1 4 12 23 8 3 ...
##  $ Rank.French          : num  19 6 11 23 1 2 14 22 8 5 ...
##  $ Rank.English         : num  10 6 11 22 1 2 13 23 15 4 ...
##  $ Rank.Mathem.         : num  7 5 13 22 2 1 18 23 15 4 ...

School.IV - ‘High Class Preparatory School for Boys’ (Xmas, Easter & July)

This is Spearman’s table of data, it’s quite complex, with nested columns for data series over time:

Spearman School.IV data

Anyone fancy entering the data, or reproducing Fancher’s results?

Fancher’s Table 3:

Fancher table 3

Fancher’s Table 4:

Fancher table 4