library('dplyr')
library('reshape2')
library('DescTools')
library('xtable')
Spearman’s study of school childrens performance in sensory and academic tests was one basis for IQ theory and often cited as the original correlation analysis and a precursor to factor analysis.
I’ve seen Spearman’s final tables, e.g. for the School IV series, but not the raw data as Spearman shared it in his original publication in 1904.
I’d like more educational data in R, and this seems like a worthwhile dataset, for illustrating some statistics, like correlation, but also issues with reproducibility.
An .html version of Spearman’s study exists online, courtesy of Christopher D. Green of York University, Toronto. The tables of School data are in the Appendix, image files copied from a .pdf:
‘General Intelligence,’ Objectively Determined and Measured (1904), Charles Spearman, American Journal of Psychology 15, 201-293.
A large pdf copy is online courtesy of Niels Waller of the University of Minnesota: Readings: Spearman1904.pdf
There are some issues with the data, examined in this paper:
Spearman’s original computation of g : A model for Burt?, Fancher, Raymond E., British Journal of Psychology, 1985, Vol.76(3), pp.341-352
Unfortunately that’s only available in the academic sphere, to subscribing institutions, but I’ve extracted some of Fancher’s tables below.
What’s the best way to reconstruct Spearman’s remaining data series?
And is it possible to identify any errors and recalculate the correlation tables, to confirm either Spearman or Fancher’s results?
Spearman’s table of data as published:
Sex <- c('f','m','f','f','m','f','f','f','m','m','f','f'
,'f','f','f','m','m','f','m','m','m','m','f','m')
Years <- c(11,12,12,13,11,11,11,13,12,12,12,13,13,12,10,11,10,11,13,12,10,11,11,11)
Months <- c(6,11,8,8,4,11,3,1,5,7,8,10,1,1,6,5,0,9,7,6,4,7,2,2)
Pitch <- c(8,15,14,13,5,25,10,10,18,14,60,20,40,45,33,25,90,17,24,18,70,17,28,90)
Light <- c(4,3,6,4,14,7,19,12,11,30,3,12,5,12,5,4,15,15,26,35,10,42,20,25)
Weight <- c(4,4,4,9,7,4,8,10,9,7,10,10,12,9,15,28,5,20,13,14,14,16,17,18)
Sense.A <- c(6,11,16,1,3,10,8,2,5,21,12,13,4,9,15,17,22,14,19,18,23,24,7,20)
Sense.B <- c(5,7,10,1,2,14,19,4,6,22,9,12,8,13,18,11,21,20,17,3,24,23,15,16)
Clever <- c(2,22,7,1,3,9,12,6,11,19,4,18,8,14,10,17,5,15,24,6,20,23,13,21)
School.I <- data.frame(Sex, Years, Months
, Pitch, Light, Weight
, Sense.A, Sense.B, Clever
, stringsAsFactors = FALSE)
str(School.I)
## 'data.frame': 24 obs. of 9 variables:
## $ Sex : chr "f" "m" "f" "f" ...
## $ Years : num 11 12 12 13 11 11 11 13 12 12 ...
## $ Months : num 6 11 8 8 4 11 3 1 5 7 ...
## $ Pitch : num 8 15 14 13 5 25 10 10 18 14 ...
## $ Light : num 4 3 6 4 14 7 19 12 11 30 ...
## $ Weight : num 4 4 4 9 7 4 8 10 9 7 ...
## $ Sense.A: num 6 11 16 1 3 10 8 2 5 21 ...
## $ Sense.B: num 5 7 10 1 2 14 19 4 6 22 ...
## $ Clever : num 2 22 7 1 3 9 12 6 11 19 ...
Fancher indicates an issue with the variable Clever, a ranking of Cleverness.
Sorting or tablulating the ranks illustrates the issue:
dfr <- School.I
sort(dfr$Clever) # 2 are ranked 6, 16 is missing
[1] 1 2 3 4 5 6 6 7 8 9 10 11 12 13 14 15 17 18 19 20 21 22 23
[24] 24
length(dfr$Clever) == length(unique(dfr$Clever)) # does length equal length of unique values?
[1] FALSE
addmargins(xtabs(~Sex+Clever, data = dfr), margin = 1) # a m & f both ranked 6
Clever
Sex 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 17 18 19 20 21 22 23 24
f 1 1 0 1 0 1 1 1 1 1 0 1 1 1 1 0 1 0 0 0 0 0 0
m 0 0 1 0 1 1 0 0 0 0 1 0 0 0 0 1 0 1 1 1 1 1 1
Sum 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
Desc(as.factor(Clever), data = dfr, plotit = TRUE) # treating rank as factor, not a number
-------------------------------------------------------------------------
as.factor(Clever) (factor)
length n NAs levels unique dupes
24 24 0 23 23 y
level freq perc cumfreq cumperc
1 6 2 .083 2 .083
2 1 1 .042 3 .125
3 2 1 .042 4 .167
4 3 1 .042 5 .208
5 4 1 .042 6 .250
6 5 1 .042 7 .292
7 7 1 .042 8 .333
8 8 1 .042 9 .375
9 9 1 .042 10 .417
10 10 1 .042 11 .458
11 11 1 .042 12 .500
12 12 1 .042 13 .542
... etc.
[list output truncated]
A closer look:
which(dfr$Clever == 6) # which rows/observations equal 6 for 'Clever'
## [1] 8 20
dfr[Clever == 6, ] # return those rows where 'Clever' equals 6
## Sex Years Months Pitch Light Weight Sense.A Sense.B Clever
## 8 f 13 1 10 12 10 2 4 6
## 20 m 12 6 18 35 14 18 3 6
dfr[20,'Clever'] <- 16 # take 'Clever' to be consistent with other ranks & measures
length(dfr$Clever) == length(unique(dfr$Clever)) # ranks are unique
## [1] TRUE
Perhaps the raw data isn’t available because it’s problematic, due to some errors in calculation or reproduction for publication. An attempt at research that isn’t quite reproducible from the published data.
Spearman’s results:
Re-calculating Spearman’s results, using -age and cor(method = 'spearman'):
dfr_cor <- dfr %>%
select(which(sapply(dfr, is.numeric) == TRUE)) %>%
mutate(Age = (Years + Months/12) # that's how I measure age
, Age = -Age # no need to rank, cor method = 'spearman' does that, duh!
) %>%
select(Age, seq(3,ncol(dfr)-1)) %>%
cor(method = 'spearman')
dfr_cor %>%
round(2)
Age Pitch Light Weight Sense.A Sense.B Clever
Age 1.00 0.32 0.20 0.21 0.33 0.49 0.01
Pitch 0.32 1.00 -0.02 0.41 0.44 0.42 0.25
Light 0.20 -0.02 1.00 0.30 0.42 0.45 0.47
Weight 0.21 0.41 0.30 1.00 0.31 0.29 0.45
Sense.A 0.33 0.44 0.42 0.31 1.00 0.73 0.65
Sense.B 0.49 0.42 0.45 0.29 0.73 1.00 0.54
Clever 0.01 0.25 0.47 0.45 0.65 0.54 1.00
The results don’t reproduce quite the same values as Spearman, as may be expected, given the correction.
However, nor do they reproduce Fancher’s results, perhaps due to the way I’ve treated age or other variables.
Fancher’s results Table 2:
How should Age be calculated and evaluated in correlation with other variables?
Presenting it in similar fashion is tricky, e.g. this attempt doesn’t deal with duplication of pairs.
print(
dfr_cor %>%
melt(measure.vars = 'value') %>%
filter(!Var1 == Var2)
, format = 'html', digits = 2
, caption = 'Recalculating Spearman')
Var1 Var2 value
1 Pitch Age 0.317
2 Light Age 0.198
3 Weight Age 0.214
4 Sense.A Age 0.326
5 Sense.B Age 0.489
6 Clever Age 0.010
7 Age Pitch 0.317
8 Light Pitch -0.016
9 Weight Pitch 0.410
10 Sense.A Pitch 0.438
11 Sense.B Pitch 0.420
12 Clever Pitch 0.246
13 Age Light 0.198
14 Pitch Light -0.016
15 Weight Light 0.297
16 Sense.A Light 0.424
17 Sense.B Light 0.449
18 Clever Light 0.469
19 Age Weight 0.214
20 Pitch Weight 0.410
21 Light Weight 0.297
22 Sense.A Weight 0.307
23 Sense.B Weight 0.293
24 Clever Weight 0.445
25 Age Sense.A 0.326
26 Pitch Sense.A 0.438
27 Light Sense.A 0.424
28 Weight Sense.A 0.307
29 Sense.B Sense.A 0.725
30 Clever Sense.A 0.647
31 Age Sense.B 0.489
32 Pitch Sense.B 0.420
33 Light Sense.B 0.449
34 Weight Sense.B 0.293
35 Sense.A Sense.B 0.725
36 Clever Sense.B 0.538
37 Age Clever 0.010
38 Pitch Clever 0.246
39 Light Clever 0.469
40 Weight Clever 0.445
41 Sense.A Clever 0.647
42 Sense.B Clever 0.538
DescTools offers a couple of graphical ways to represent the same matrix of correlations:
PlotCorr(dfr_cor)
PlotWeb(dfr_cor, col=c(hred, hblue))
Adding a couple more of Spearman’s School data series:
Sex <- c('m','m','f','m','f','m','m','m','m','f','f','f'
,'f','f','m','m','m','f','f','m','m','m','m','f'
,'m','m','m','m','m','f','f','m','m','m','m')
Age.Years <- c(9,8,7,8,8,8,9,8,9,7,8,7
,9,5,5,10,7,7,6,9,7,8,6,6
,7,5,5,6,6,7,6,6,7,5,6)
Age.Months <- c(7,7,3,0,0,1,9,2,0,11,2,2
,11,8,7,2,0,1,9,6,10,2,7,10
,11,6,11,6,5,0,5,1,3,6,6)
Pitch <- c('6','6','16','24','26','35','35','35','38','42','48','48'
,'67','67','70','74','74','77','77','80','80','96','96','104'
,'112','112','120','120','>120','>120','>120','>120','>120','>120','>120')
Intellectual.Class <- factor(c(2,2,1,2,2,2,3,1,3,2,2,1
,3,NA,NA,2,NA,1,NA,3,1,2,NA,NA
,2,NA,NA,NA,NA,1,NA,NA,1,NA,NA)
, ordered = TRUE)
str(Intellectual.Class) # reverse this order?
## Ord.factor w/ 3 levels "1"<"2"<"3": 2 2 1 2 2 2 3 1 3 2 ...
School.II <- data.frame(Sex, Age.Years, Age.Months
, Pitch, Intellectual.Class
, stringsAsFactors = FALSE)
str(School.II)
## 'data.frame': 35 obs. of 5 variables:
## $ Sex : chr "m" "m" "f" "m" ...
## $ Age.Years : num 9 8 7 8 8 8 9 8 9 7 ...
## $ Age.Months : num 7 7 3 0 0 1 9 2 0 11 ...
## $ Pitch : chr "6" "6" "16" "24" ...
## $ Intellectual.Class: Ord.factor w/ 3 levels "1"<"2"<"3": 2 2 1 2 2 2 3 1 3 2 ...
Sex <- c('f','m','f','f','m','f','f','f'
,'m','m','f','f','f','f','f','m'
,'m','f','m','m','m','m','f','m')
Age.Years <- c(10,12,11,10,13,12,10,9
,12,10,11,10,9,11,11,10
,10,10,10,10,12,10,13)
Age.Months <- c(9,4,1,11,7,6,4,5
,0,2,2,1,8,10,1,6
,8,4,4,1,3,7,3)
Discriminative.Pitch <- c('50','3','10','>60','4','2','4','20
','11','11','24','5','3','5','6','7
','15','11','14','15','7','4','>60')
Discriminative.Light <- c(10,10,10,10,12,10,10,10
,10,12,14,18,18,13,13,14
,19,14,13,13,19,16,19)
Discriminative.Weight <- c(4,6,9,9,5,10,11,11
,12,11,10,7,9,13,13,11
,10,13,18,28,13,16,27)
Rank.Classics <- c(16,5,13,22,1,4,12,23
,8,3,7,20,10,2,11,17
,21,19,18,15,9,14,6)
Rank.French <- c(19,6,11,23,1,2,14,22
,8,5,7,15,13,3,12,18
,20,21,16,10,9,17,4)
Rank.English <- c(10,6,11,22,1,2,13,23
,15,4,7,18,14,3,12,17
,21,9,8,20,16,19,5)
Rank.Mathem. <- c(7,5,13,22,2,1,18,23
,15,4,6,16,12,3,9,13
,19,21,17,10,11,20,8)
School.III <- data.frame(Age.Years, Age.Months
, Discriminative.Pitch
, Discriminative.Light
, Discriminative.Weight
, Rank.Classics
, Rank.French
, Rank.English
, Rank.Mathem.
, stringsAsFactors = FALSE)
str(School.III)
## 'data.frame': 23 obs. of 9 variables:
## $ Age.Years : num 10 12 11 10 13 12 10 9 12 10 ...
## $ Age.Months : num 9 4 1 11 7 6 4 5 0 2 ...
## $ Discriminative.Pitch : chr "50" "3" "10" ">60" ...
## $ Discriminative.Light : num 10 10 10 10 12 10 10 10 10 12 ...
## $ Discriminative.Weight: num 4 6 9 9 5 10 11 11 12 11 ...
## $ Rank.Classics : num 16 5 13 22 1 4 12 23 8 3 ...
## $ Rank.French : num 19 6 11 23 1 2 14 22 8 5 ...
## $ Rank.English : num 10 6 11 22 1 2 13 23 15 4 ...
## $ Rank.Mathem. : num 7 5 13 22 2 1 18 23 15 4 ...
This is Spearman’s table of data, it’s quite complex, with nested columns for data series over time:
Anyone fancy entering the data, or reproducing Fancher’s results?
Fancher’s Table 3:
Fancher’s Table 4: