Introduction

Wicherts (2016) argues that there are multiple problems with using Jensen’s method (method of correlated vectors) on item level data, such as has been done by te Nijenhuis et al (many papers) and Kirkegaard (2015). One of the problems involves the assumption that item-total Pearson correlations can be used as estimates of ‘g-loadings’ of items. If we assume for simplicity a simplified scale that consists entirely of common factor variance and error variance, then the Pearson correlation between the total score and the item depends on two values:

The item’s discriminateness.
The item’s pass rate.

The reason for the second is that because a single item is a dichotomous score, its variance depends on the pass rate (base rate). Consider a simple example where each item has a discriminateness. (item response theory) of 1.0, but different pass rates:

#generate data
set.seed(1)
d = data_frame(
  total_score = rnorm(1e3),
  item_1 = as.numeric(total_score > qnorm(.50)), #threshold at 50%
  item_2 = as.numeric(total_score > qnorm(.60)), #at 60%
  item_3 = as.numeric(total_score > qnorm(.70)), #at 70%
  item_4 = as.numeric(total_score > qnorm(.80)), #at 80%
  item_5 = as.numeric(total_score > qnorm(.90))  #at 90%
)

#observed Pearson correlations
cor(d)

##             total_score item_1 item_2 item_3 item_4 item_5
## total_score        1.00   0.80   0.79   0.76   0.70   0.60
## item_1             0.80   1.00   0.84   0.69   0.52   0.36
## item_2             0.79   0.84   1.00   0.81   0.62   0.43
## item_3             0.76   0.69   0.81   1.00   0.77   0.53
## item_4             0.70   0.52   0.62   0.77   1.00   0.69
## item_5             0.60   0.36   0.43   0.53   0.69   1.00

#variances
var(d)

##             total_score item_1 item_2 item_3 item_4 item_5
## total_score        1.07  0.412  0.400  0.361  0.292  0.193
## item_1             0.41  0.250  0.207  0.158  0.106  0.056
## item_2             0.40  0.207  0.240  0.183  0.123  0.065
## item_3             0.36  0.158  0.183  0.212  0.142  0.075
## item_4             0.29  0.106  0.123  0.142  0.163  0.086
## item_5             0.19  0.056  0.065  0.075  0.086  0.096

#density of normal durve at the same z-scores
dnorm(qnorm(seq(.5, .9, .1)))

## [1] 0.40 0.39 0.35 0.28 0.18

Thus, we observe that the Pearson correlations fall in perfect accordance with the variance of the items, making this a special case of variance reduction (‘restriction of range’). Notice the relationship between the item variances and the densities of the normal curve at the same z values used to create the items. Given such a regularity, readers may wonder whether it is possible to estimate what the Pearson correlation would have been if the items were not dichotomized, and it is. The formula is given by Hunter and Schmidt (2004; rearranged for readability):

\[r_{true} = r_{observed} \cdot (\frac{d(z)}{sqrt(P \cdot Q)})^{-1}\] where P is the proportion of cases above the threshold (pass rate), Q is the proportion below (so Q = 1-P), and d is the density of the normal curve at the z score used to dichotomize the distribution. For example, for the first item, these values are:

\[r_{true} = .80 \cdot (\frac{.40}{sqrt(.5 \cdot .5)})^{-1} = .80 \cdot (\frac{.40}{.5})^{-1} = .80 \cdot .80^{-1} = 1\]

and for the item with a pass rate of .10:

\[r_{true} = .60 \cdot (\frac{.18}{sqrt(.1 \cdot .9)})^{-1} = .60 \cdot (\frac{.18}{.30})^{-1} = .60 \cdot .60^{-1} = 1\]

This way of estimating what a correlations would have been is a special class of latent correlations called the biserial correlation (not to be confused with the point-biserial correlation).

Jensen’s method applied to item pass rates: a simulated example

Before trying the corrected Jensen’s method, let’s first see the problem Wicherts is talking about. To do that we have to generate some data. We will use 100 items, although more commonly the 60 items of the Raven’s Standard Progressive Matrices have been used. Using more reduces indicator (psychometric) sampling error which we are not interested in here. As before, we assume a single factor structure (á la Spearman’s original model) with only two sources of variance: g variance and error variance. We also let the items vary randommly in their pass rates and their discriminateness.

#we generate the item data
set.seed(1)
d2_items = data_frame(
  #id
  id = 1:100,
  #generate some item pass rates
  pass_rate = pnorm(rnorm(100)),
  #their z score equivalents  = item difficulty in IRT
  item_difficulty = qnorm(pass_rate, lower.tail = F),
  #item discriminatively
  discrimination = runif(100, min = .1, max = 2)
)

#item col names
d2_item_names = "item_" + 1:nrow(d2_items)

#sample size
simul_n = 3e4

Having set up our item data, we generate 30000 cases from three groups with mean abilities of -.50, 0 and .50.

set.seed(1)
#generate groups
d2_data = data_frame(group = rep(LETTERS[1:3], length.out = simul_n))

#generate person latent ability levels
d2_data$latent_ability = NA
d2_data[d2_data$group == "A", "latent_ability"] = rnorm(nrow(d2_data)/3, mean = -.5)
d2_data[d2_data$group == "B", "latent_ability"] = rnorm(nrow(d2_data)/3, mean = 0)
d2_data[d2_data$group == "C", "latent_ability"] = rnorm(nrow(d2_data)/3, mean = .5)

#generate item scores
#for each item
set.seed(1)
for (item in 1:nrow(d2_items)) {
  #u, from 2PN
  x = d2_items$discrimination[item] * (d2_data$latent_ability - d2_items$item_difficulty[item])
  
  #chance of getting item right for each person
  y = pnorm(x, lower.tail = F)
  
  #roll the dice and save output
  d2_data[["item_" + item]] = (runif(nrow(d2_data)) > y)
}

#total score (sum)
d2_data$total = rowSums(d2_data[-c(1:2)])

#plot
ggplot(d2_data, aes(total)) +
  geom_density(fill = "black", alpha = .3) +
  xlab("Total score")

ggplot(d2_data, aes(total, fill = group, color = group)) +
  geom_density(alpha = .3) +
  xlab("Total score")

#reorder
d2_data = df_reorder_columns(d2_data, c("group" = 1, "latent_ability" = 2, "total" = 3))

#empirical pass rates
d2_items$pass_rate_empirical = colMeans(d2_data[-(1:3)])

#inspect some Pearson correlations
cor(d2_data[2:10])

##                latent_ability total item_1 item_2 item_3 item_4 item_5
## latent_ability           1.00  0.97   0.43   0.39   0.57   0.39   0.34
## total                    0.97  1.00   0.44   0.40   0.57   0.40   0.36
## item_1                   0.43  0.44   1.00   0.17   0.26   0.17   0.14
## item_2                   0.39  0.40   0.17   1.00   0.22   0.15   0.14
## item_3                   0.57  0.57   0.26   0.22   1.00   0.20   0.19
## item_4                   0.39  0.40   0.17   0.15   0.20   1.00   0.13
## item_5                   0.34  0.36   0.14   0.14   0.19   0.13   1.00
## item_6                   0.57  0.57   0.27   0.22   0.38   0.20   0.19
## item_7                   0.61  0.63   0.26   0.24   0.33   0.27   0.22
##                item_6 item_7
## latent_ability   0.57   0.61
## total            0.57   0.63
## item_1           0.27   0.26
## item_2           0.22   0.24
## item_3           0.38   0.33
## item_4           0.20   0.27
## item_5           0.19   0.22
## item_6           1.00   0.33
## item_7           0.33   1.00

We have some fairly realistic looking data. The A and C groups are somewhat skewed due to floor/ceiling problems. The sum (total score) of the 100 items correlates near 1.00 (0.97) with the person’s true latent ability. This is despite the fact that all the items are somewhat to very noisy measures (Spearman-Brown formula).

To apply Jensen’s method the way it has been done by te Nijenhuis et al, we calculate the group differences on each item which in this case we will do as the difference in pass rates. We also calculate the item-total Pearson correlations.

#calculate group item pass rates
d2_items$A_pass_rate = colMeans(d2_data[d2_data$group == "A", -c(1:3)])
d2_items$B_pass_rate = colMeans(d2_data[d2_data$group == "B", -c(1:3)])
d2_items$C_pass_rate = colMeans(d2_data[d2_data$group == "C", -c(1:3)])

#differences in rates
d2_items$A_B_difference = d2_items$B_pass_rate - d2_items$A_pass_rate
d2_items$A_C_difference = d2_items$C_pass_rate - d2_items$A_pass_rate
d2_items$B_C_difference = d2_items$C_pass_rate - d2_items$B_pass_rate

#item-total Pearson correlations
d2_items$item_total = cor(d2_data[-c(1:2)])[-1, 1]

With these vectors calculated, we can now plot them to apply Jensen’s method.

#plots
GG_scatter(d2_items, "item_total", "A_B_difference")

GG_scatter(d2_items, "item_total", "A_C_difference")

GG_scatter(d2_items, "item_total", "B_C_difference")

We can see the problem. We know the score differences between groups are entirely due to g, but the correlations are not quite 1.00 and we see some odd outlying points. There seems to be something influencing the relationship. What if we instead used the latent (biserial) correlations?

#calculate biserial item-total correlations
# d2_items$item_total_latent = hetcor(d2_data[3:ncol(d2_data)], std.err = F)$correlations[-1, 1]
quietly({d2_items$item_total_latent = map_dbl(df_subset_by_pattern(d2_data, pattern = "item_"), function(x) biserial(d2_data$total, x))})

## function (...) 
## capture_output(.f(...))
## <environment: 0x00000000147128e8>

#plots
GG_scatter(d2_items, "item_total_latent", "A_B_difference")

GG_scatter(d2_items, "item_total_latent", "A_C_difference")

GG_scatter(d2_items, "item_total_latent", "B_C_difference")

Surprisingly, perhaps, these correlations are even odder than before. There’s clearly something else amiss.

Item response theory?

What if we take the plunge and use item response theory factor analysis? Note that doing this analysis requires case-level data or at least a correlation matrix of latent correlations between items. This approach seriously limuts our ability to use older datasets due to lack of data sharing. How do the loadings from this analysis relate to the latent item-total correlations?

#IRT fa
irt_fa = irt.fa(d2_data[-(1:3)], plot = F, sort = F)

d2_items$irt_loadings = irt_fa$fa$loadings %>% as.vector

#plot
GG_scatter(d2_items, "item_total_latent", "irt_loadings")

They are almost the same. The very small discrepancies have to do with slight differences in approximation/fitting formulas used by the functions, I suspect. So, the culprit does not seem to be the loadings. Is it the pass rates? The blogger La Griffe du Lion has written in detail about the problem of using differences in pass rates as measures of group differences, especially when thresholds change over time. This is analogous to our case where they differ by item. The reason it is problematic is that differences in pass rates are nonlinear transformations of the differences of the underlying normal distributions (z scores). Fortunately, one can convert pass rates to z scores. We have two choices of which difficulties to use, observed difficulties or those estimated from IRT.

#calculate the empirical item difficulties (z scores)
d2_items$item_difficulty_empirical = qnorm(d2_items$pass_rate_empirical, lower.tail = F)

#IRT difficulties
d2_items$item_difficulty_irt = irt_fa$irt$difficulty[[1]]

#how do they relate?
GG_scatter(d2_items, "item_difficulty_empirical", "item_difficulty_irt")

#calculate group mean scores (z scores) for each item
d2_items$A_mean_z = qnorm(d2_items$A_pass_rate, mean = d2_items$item_difficulty_empirical)
d2_items$B_mean_z = qnorm(d2_items$B_pass_rate, mean = d2_items$item_difficulty_empirical)
d2_items$C_mean_z = qnorm(d2_items$C_pass_rate, mean = d2_items$item_difficulty_empirical)

#do the values seem realistic?
head(d2_items[c("A_mean_z", "B_mean_z", "C_mean_z")], 10)

## # A tibble: 10 × 3
##    A_mean_z B_mean_z C_mean_z
##       <dbl>    <dbl>    <dbl>
## 1     -0.27  -0.0058     0.25
## 2     -0.23   0.0022     0.23
## 3     -0.40  -0.0432     0.36
## 4     -0.24   0.0242     0.27
## 5     -0.20   0.0131     0.20
## 6     -0.40  -0.0393     0.35
## 7     -0.36   0.0280     0.38
## 8     -0.16   0.0022     0.17
## 9     -0.25   0.0153     0.26
## 10    -0.42  -0.0367     0.42

The naive empirical difficulties agreed well with those from IRT. Regarding, the group differences by item: What kind of values did we expect to see? Remember the real trait group means are -.50, 0 and .50. So we expect the items to be lower than these numbers due to their varying discriminateness. This is indeed what we see. Using these, we can calculate new group difference vectors, and re-do Jensen’s method.

#new group difference vectors
d2_items$A_B_difference_z = d2_items$B_mean_z - d2_items$A_mean_z
d2_items$A_C_difference_z = d2_items$C_mean_z - d2_items$A_mean_z
d2_items$B_C_difference_z = d2_items$C_mean_z - d2_items$B_mean_z

#Jensen's method
GG_scatter(d2_items, "item_total_latent", "A_B_difference_z")

GG_scatter(d2_items, "item_total_latent", "A_C_difference_z")

GG_scatter(d2_items, "item_total_latent", "B_C_difference_z")

The correlations are now very close to 1.0, the expected value. THe consistently outlying item 14 is due to its very low pass rate (0.03) which causes statistical instability of results.

Multigroup confirmatory factor analysis

Wicherts and other researchers prefer more sophisticated methods for examining to which degree group differences in IQ scores/total scores are due to g and what is due to other things, including other ability factors and test bias. In an examplary study, Frisby and Beaujean (2015) used both Jensen’s method and MGCFA (multi-group confirmatory factor analysis) to examine the same dataset. We will do the same here. (NOT COMPLETED YET!)

Using classical test theory statistics with Jensen’s method

A comment on Wicherts (2016)

Emil O. W. Kirkegaard

Introduction

Jensen’s method applied to item pass rates: a simulated example

Item response theory?

Multigroup confirmatory factor analysis