Classical Test Theory (CTT)

CTT (aka weak theory or true-score theory) aims to predict outcomes of the test such as item difficulty or the ability of test-takers to improve reliability of the test itself.

CTT assumes that each person has a true score that would be obtained if there were no errors in measurement. Further, this approach to psychometric employs common estimate of measurement precision as a milestone to measure test takers’ level of construct regardless of their individual attribute.

The major advantage of CTT are its relatively weak theoretical assumptions, which make CTT easy to apply in many testing situations. Relatively weak theoretical assumptions not only characterize CTT but also its extensions (e.g., generalizability theory)

setwd("D:/Class Materials & Work/Summer 2020 practice/CTT")

library(psych)
library(CTT)
library(tidyverse)

Test Scoring and Descriptive Statistic

First, we will convert a test answer into a dichotomous data set (0,1) by scoring with score()

#Importing test data
data <- read_csv("sample.score.csv", col_names = T)

#Importing test key
key <- read_csv("sample.key.csv", col_names = T) 
key <- as.matrix(key)

#scoring
myScore <- score(data, key, output.scored=TRUE)

#Descriptive
describe(myScore$score)
##    vars   n mean sd median trimmed  mad min max range skew kurtosis   se
## X1    1 241 9.44  3      9    9.34 2.97   1  19    18 0.26    -0.14 0.19
#Sum score
sum_score <- as.data.frame(myScore$score)

We can also transform the score into a new scale by setting a new mean, SD, and normality with score.transform(). We will refer to the IQ score, with its mean = 100 and its SD = 15.

One can also transform the metric into a Z score (mean = 0, SD = 1), T score (mean = 50, SD = 10), or even Stanine score (mean = 5, SD = 2).

#Score transformation
IQ <- score.transform(myScore$score, mu.new = 100, sd.new = 15, normalize = T)

Item Characteristic Curves of Item 1

cttICC(score = myScore$score, itemVector = myScore$scored[,1], colTheme="spartans", cex=1.5)

We will convert our score data into matrix for further implementation.

#extract responses only
responses <- as.matrix(myScore$scored) 

dimnames(responses) <- NULL

#Assign 
(N <- dim(responses)[2]) # number of items
## [1] 20
(K = dim(responses)[1]) # number of test-takers
## [1] 241

1.1 Reliability Analysis

Item analysis within the classical approach often relies on two statistics for evaluating single items: the P-value and the point-biserial correlation coefficient.

The P-value represents the proportion of examinees responding in the keyed direction, and is typically referred to as item difficulty.

The point-biserial correlation coefficient is a particular item’s correlation with all of the other items and it provides an index of differentiating power of the item, which is typically referred to as item discrimination.

Internal Consistency

For starter, we will compute internal consistency of the test, which encompasses Cronbach’s Alpha, KR-20, and KR-21.

For Cronbach’s Alpha:

#Cronbach Alpha and Guttman's Lambda 6

#With psych package
psych::alpha(myScore$scored, check.keys = T)
## Warning in psych::alpha(myScore$scored, check.keys = T): Some items were negatively correlated with total scale and were automatically reversed.
##  This is indicated by a negative sign for the variable name.
## 
## Reliability analysis   
## Call: psych::alpha(x = myScore$scored, check.keys = T)
## 
##   raw_alpha std.alpha G6(smc) average_r S/N   ase mean   sd median_r
##       0.55      0.55    0.58     0.057 1.2 0.041 0.48 0.15    0.045
## 
##  lower alpha upper     95% confidence boundaries
## 0.47 0.55 0.63 
## 
##  Reliability if an item is dropped:
##      raw_alpha std.alpha G6(smc) average_r  S/N alpha se  var.r med.r
## V1        0.55      0.54    0.58     0.059 1.18    0.042 0.0081 0.045
## V2        0.56      0.55    0.58     0.060 1.21    0.041 0.0079 0.048
## V3        0.56      0.55    0.59     0.060 1.22    0.041 0.0080 0.048
## V4        0.56      0.56    0.59     0.062 1.25    0.040 0.0080 0.052
## V5        0.55      0.54    0.58     0.059 1.20    0.041 0.0081 0.048
## V6        0.51      0.50    0.54     0.050 1.00    0.046 0.0071 0.041
## V7        0.56      0.55    0.59     0.062 1.25    0.040 0.0078 0.051
## V8        0.55      0.54    0.58     0.059 1.19    0.041 0.0080 0.048
## V9        0.55      0.54    0.58     0.058 1.17    0.042 0.0080 0.048
## V10-      0.56      0.56    0.59     0.062 1.25    0.040 0.0079 0.052
## V11       0.54      0.54    0.57     0.057 1.15    0.042 0.0081 0.045
## V12       0.55      0.54    0.58     0.058 1.17    0.042 0.0080 0.044
## V13       0.53      0.52    0.56     0.055 1.10    0.043 0.0074 0.044
## V14       0.53      0.52    0.56     0.055 1.10    0.043 0.0076 0.043
## V15       0.54      0.53    0.57     0.057 1.14    0.042 0.0076 0.044
## V16       0.51      0.51    0.55     0.051 1.03    0.045 0.0068 0.044
## V17       0.49      0.48    0.52     0.047 0.93    0.048 0.0056 0.039
## V18       0.51      0.50    0.53     0.050 1.00    0.046 0.0060 0.044
## V19       0.55      0.54    0.58     0.059 1.19    0.042 0.0079 0.045
## V20       0.55      0.55    0.58     0.060 1.20    0.041 0.0077 0.048
## 
##  Item statistics 
##        n raw.r std.r r.cor r.drop mean   sd
## V1   241  0.26  0.28 0.181  0.125 0.76 0.43
## V2   241  0.23  0.24 0.127  0.083 0.73 0.44
## V3   241  0.22  0.23 0.115  0.082 0.26 0.44
## V4   241  0.20  0.20 0.066  0.043 0.68 0.47
## V5   241  0.26  0.26 0.152  0.106 0.37 0.48
## V6   241  0.51  0.50 0.497  0.370 0.54 0.50
## V7   241  0.19  0.20 0.074  0.045 0.28 0.45
## V8   241  0.27  0.27 0.163  0.113 0.48 0.50
## V9   241  0.26  0.29 0.199  0.143 0.83 0.38
## V10- 241  0.21  0.20 0.067  0.043 0.56 0.50
## V11  241  0.32  0.32 0.228  0.169 0.34 0.47
## V12  241  0.31  0.30 0.205  0.148 0.56 0.50
## V13  241  0.39  0.38 0.323  0.240 0.39 0.49
## V14  241  0.40  0.38 0.327  0.247 0.45 0.50
## V15  241  0.33  0.33 0.254  0.183 0.31 0.46
## V16  241  0.47  0.47 0.452  0.337 0.64 0.48
## V17  241  0.60  0.59 0.647  0.478 0.44 0.50
## V18  241  0.52  0.50 0.526  0.387 0.33 0.47
## V19  241  0.25  0.27 0.176  0.123 0.19 0.39
## V20  241  0.26  0.25 0.152  0.097 0.44 0.50
## 
## Non missing response frequency for each item
##          0    1 miss
##  [1,] 0.24 0.76    0
##  [2,] 0.27 0.73    0
##  [3,] 0.74 0.26    0
##  [4,] 0.32 0.68    0
##  [5,] 0.63 0.37    0
##  [6,] 0.46 0.54    0
##  [7,] 0.72 0.28    0
##  [8,] 0.52 0.48    0
##  [9,] 0.17 0.83    0
## [10,] 0.56 0.44    0
## [11,] 0.66 0.34    0
## [12,] 0.44 0.56    0
## [13,] 0.61 0.39    0
## [14,] 0.55 0.45    0
## [15,] 0.69 0.31    0
## [16,] 0.36 0.64    0
## [17,] 0.56 0.44    0
## [18,] 0.67 0.33    0
## [19,] 0.81 0.19    0
## [20,] 0.56 0.44    0
#With a written function
cronbachs.alpha <-
  function(X)
    {
    
    X <- data.matrix(X)
    n <- ncol(X) # Number of items
    k <- nrow(X) # Number of examinees
    
    # Cronbachs alpha
    alpha <- (n/(n - 1))*(1 - sum(apply(X, 2, var))/var(rowSums(X)))
    
    return(list("Crombach's alpha" = alpha,
                "Number of items" = n,
                "Number of examinees" = k))
    }

#Dump "cronbach.alpha" function for further use
dump("cronbachs.alpha", file = "cronbachs.alpha.R")

#Compute Cronbach's alpha
cronbachs.alpha(responses)
## $`Crombach's alpha`
## [1] 0.5395239
## 
## $`Number of items`
## [1] 20
## 
## $`Number of examinees`
## [1] 241
#Compare
CTT::reliability(responses)
## You will find additional options and better formatting using itemAnalysis().
## 
##  Number of Items 
##  20 
## 
##  Number of Examinees 
##  241 
## 
##  Coefficient Alpha 
##  0.54

For Kuder-Richardson formula 20 (KR20) and Kuder-Richardson formula 21 (KR21)

# formula 20
KR20 <-
  function(X)
    {
    X <- data.matrix(X)
    k <- ncol(X)
    
    # Person total score variances
    SX <- var(rowSums(X))
    
    # item means
    IM <- colMeans(X)
    
    return(((k/(k - 1))*((SX - sum(IM*(1 - IM)))/SX)))
  }

KR20(responses)
## [1] 0.541653
# formula 21

KR21 <-
  function(X)
    {
    X <- data.matrix(X)
    n <- ncol(X)
    
    return((n/(n-1))*((var(rowSums(X)) - n*(sum(colMeans(X))/n) * 
                         (1-(sum(colMeans(X))/n))))/var(rowSums(X)))
  }

KR21(responses)
## [1] 0.4700428

Split-half (Test-Retest) Reliability

The test-retest reliability is an estimation of reliability based on the correlation of two equivalent forms of tests. Correlation between the split-halves is a reasonable measure of the reliability of one half of the test. Reliability of the entire test would be greater than the reliability of either half taken alone.

#Split data (cases) into two equally and randomly.

SPLIT.CASES <- 
  function(X, seed = NULL)
    {
    # optional fixed seed
    if (!is.null(seed)) {set.seed(seed)} 
    
    X <- as.matrix(X)
    
    # if k = 2x, then lengths Y1 = Y2
    # if k = 2x+1, then lenths Y1 = Y2+1
    k <- nrow(X)
    index <- sample(1:k, ceiling(k/2))
    Y1 <- X[index, ]
    Y2 <- X[-index, ]
    return(list(Y1, Y2))
  }

#Split data (variables-item) into two equally and randomly.

SPLIT.ITEMS <- 
  function(X, seed = NULL)
    {
    # optional fixed seed
    if (!is.null(seed)) {set.seed(seed)} 
    
    X <- as.matrix(X)
    
    # if n = 2x, then lengths Y1 = Y2
    # if n = 2x+1, then lenths Y1 = Y2+1
    n <- ncol(X)
    index <- sample(1:n, ceiling(n/2))
    Y1 <- X[, index ]
    Y2 <- X[, -index]
    return(list(Y1, Y2)) 
  }

dump("SPLIT.ITEMS", file = "SPLIT.ITEMS.R")

#with psych package

splitHalf(responses, raw = T)
## Warning in splitHalf(responses, raw = T): Some items were negatively correlated
## with total scale and were automatically reversed.
## Split half reliabilities  
## Call: splitHalf(r = responses, raw = T)
## 
## Maximum split half reliability (lambda 4) =  0.69
## Guttman lambda 6                          =  0.58
## Average split half reliability            =  0.55
## Guttman lambda 3 (alpha)                  =  0.55
## Guttman lambda 2                          =  0.57
## Minimum split half reliability  (beta)    =  0.34
## Average interitem r =  0.06  with median =  0.04
##                                              2.5% 50% 97.5%
##  Quantiles of split half reliability      =  0.44 0.55 0.63

For spearman-brown formula, which predicts the reliability of a test after changing the test length.

# Spearman-Brown formula
SpearmanBrown <- 
  function(x, n1, n2)
    {
    
    source("cronbachs.alpha.R")
    
    x <- as.matrix(x)
    N <- n2/n1
    
    # cronbach's alpha for the original test
    alpha <- cronbachs.alpha(x)[[1]]
    predicted.alpha <- N * alpha / (1 + (N - 1) * alpha)
    
    return(list(original.reliability = alpha,
                original.sample.size = n1,
                predicted.reliability = predicted.alpha,
                predicted.sample.size = n2))
  }

# predict reliability by Spearman-Brown formula
# if the number of items is reduced from 25 to 15
SpearmanBrown(responses, n1 = 20, n2 = 15)
## $original.reliability
## [1] 0.5395239
## 
## $original.sample.size
## [1] 20
## 
## $predicted.reliability
## [1] 0.4677309
## 
## $predicted.sample.size
## [1] 15
# predict reliability by Spearman-Brown formula
# if the number of items is increased from 25 to 35
SpearmanBrown(responses, n1 = 20, n2 = 35)
## $original.reliability
## [1] 0.5395239
## 
## $original.sample.size
## [1] 20
## 
## $predicted.reliability
## [1] 0.6721757
## 
## $predicted.sample.size
## [1] 35
# predict reliability by Spearman-Brown formula
# if the number of items is doubled
SpearmanBrown(responses, n1 = 1, n2 = 2)
## $original.reliability
## [1] 0.5395239
## 
## $original.sample.size
## [1] 1
## 
## $predicted.reliability
## [1] 0.7008971
## 
## $predicted.sample.size
## [1] 2

Guttman’s Lambda_3 (Guttman 1945)
This index works in the same way as Cronbach’s Alpha.

G3 <-
  function(responses)
    {
    
    X <- as.matrix(responses)
    k <- ncol(X) # number of items
    
    SX <- var(rowSums(X)) # variance of total person scores
    SI <- apply(X, 2, var) # item variances
    L1 <- (1 - (sum(SI)/SX)) # Guttman's lamda 1
    
    return((k*L1)/(k - 1))
    }

# compute G3
G3(responses)
## [1] 0.5395239
#Compare
psych::guttman(responses)
## Call: psych::guttman(r = responses)
## 
## Alternative estimates of reliability
## 
## Guttman bounds 
## L1 =  0.51 
## L2 =  0.56 
## L3 (alpha) =  0.53 
## L4 (max) =  0.69 
## L5 =  0.55 
## L6 (smc) =  0.57 
## TenBerge bounds 
## mu0 =  0.53 mu1 =  0.56 mu2 =  0.56 mu3 =  0.56 
## 
## alpha of first PC =  0.64 
## estimated greatest lower bound based upon communalities=  0.69 
## 
## beta found by splitHalf  =  0.34

Pearson product-moment correlation coefficient

We will be computing a bootstrapped (1,000 replicates) Pearson product-moment correlation coefficient between two random subsets of examinees.

pearson <- 
  function(X, seed = NULL, n = NULL)
    {
    source("SPLIT.ITEMS.R")
    
    # optional fixed seed
    if (!is.null(seed)) {set.seed(seed)}
    
    # the number of bootstrap replicates
    if (is.null(n)) {n <- 1e3}   
    
    X <- as.matrix(X)
    r <- rep(NA, n)
    
    for (i in 1:n) {
      # split items
      Y <- SPLIT.ITEMS(X)
      
      # total scores
      S1 <- as.matrix(rowSums(Y[[1]]))
      S2 <- as.matrix(rowSums(Y[[2]]))
      
      # residual scores
      R1 <- S1 - mean(S1)
      R2 <- S2 - mean(S2)
      
      # Pearson product-moment correlation coefficient
      r[i] <- (t(R1) %*% R2) / (sqrt((t(R1) %*% R1)) * sqrt((t(R2) %*% R2)))
    }
    
    return(mean(r))
  }

# compute the Pearson product-moment correlation coefficient
pearson(responses, seed = 456, n = 1)
## [1] 0.3499066
# compare
# split items
set.seed(456)
Y <- SPLIT.ITEMS(responses)

# total scores
S1 <- as.matrix(rowSums(Y[[1]]))
S2 <- as.matrix(rowSums(Y[[2]]))

cor(S1, S2)
##           [,1]
## [1,] 0.3499066

All in all, for reliability, When scores are not tau-equivalent (for example when there is not homogeneous but rather examination items of increasing difficulty) then the KR-20 is an indication of the lower bound of internal consistency (reliability).

KR21 typically underestimates the reliability of a test compared to KR20, when the item difficulties are not equal. The formulas will be equal only if the item difficulties are equal.

Cronbach’s \(\alpha\) and Kuder-Richardson (Kuder and Richardson 1937) methods produce a lower bound for a test’s reliability, which equals the test reliability if the split-halves are essentially \(\tau\)-equivalent. These coefficients should only be used for homegeneous tests, since they reflect item homegeneity, else they will be inappropriately low.

A high KR-20 coefficient (e.g., > 0.90) provides an evidence for the assumption of a homogeneous or unidimensional test. However, it is possible, to have a high KR-20 with a multidimensional scale, especially with a large number of test items.

The Spearman-Brown coefficient can over or underestimate the test reliability if the split-halves are not parallel tests. When the tests are parallel the Spearman-Brown formula is useful to test the effects of test length on reliability.

A test can produce different reliability estimates when administered to different samples of examinees. Generazibility Theory examines such systematic effects on reliability.

1.2 Standard Error of Measurement

The standard error of measurement (SEm) is a measure of how much measured test scores are spread around a “true” score. Standard error of measurement using Cronbach’s alpha as the reliability statistic cab be computed as:

SEM <-
  function(X){
    source("cronbachs.alpha.R")
    X <- data.matrix(X)
    
    return(sd(rowSums(X)) * sqrt(1 - cronbachs.alpha(X)[[1]]))
  }

SEM(responses)
## [1] 2.036401

As a result, a person’s true score in 68% CI would be ± 2.036401.

1.3 Confidence Intervals for True Scores

Note that the critical-value for a 90% confidence interval is 1.65.

# 90% confidence interval for the true score
head(cbind(lower_bound = round(rowSums(responses)-1.65* sd(rowSums(responses))*
                                 sqrt(1-KR20(responses)), 2), observed = rowSums(responses),
           upper_bound = round(rowSums(responses)+1.65* sd(rowSums(responses))*
                                 sqrt(1-KR20(responses)), 2)), 20)
##       lower_bound observed upper_bound
##  [1,]        5.65        9       12.35
##  [2,]        2.65        6        9.35
##  [3,]        9.65       13       16.35
##  [4,]        5.65        9       12.35
##  [5,]       10.65       14       17.35
##  [6,]        8.65       12       15.35
##  [7,]        3.65        7       10.35
##  [8,]        5.65        9       12.35
##  [9,]        4.65        8       11.35
## [10,]        7.65       11       14.35
## [11,]        2.65        6        9.35
## [12,]        4.65        8       11.35
## [13,]        1.65        5        8.35
## [14,]        9.65       13       16.35
## [15,]        4.65        8       11.35
## [16,]        1.65        5        8.35
## [17,]       11.65       15       18.35
## [18,]        4.65        8       11.35
## [19,]        6.65       10       13.35
## [20,]        9.65       13       16.35

1.4 Item Analysis

Item analysis within the classical approach often relies on two statistics for evaluating single items: the P-value and the point-biserial correlation coefficient.

The P-value represents the proportion of examinees responding in the keyed direction, and is typically referred to as item difficulty.

The point-biserial correlation coefficient is an item’s correlation with all of the other items. It provides an index of differentiating power of the item, which is typically referred to as item discrimination.

item.analysis <- 
  function(responses)
    {
    # CRITICAL VALUES
    cvpb = 0.20
    cvdl = 0.15
    cvdu = 0.85
    
    require(CTT, warn.conflicts = FALSE, quietly = TRUE)
    (ctt.analysis <- CTT::reliability(responses, itemal = TRUE, NA.Delete = TRUE))
    
    # Mark items that are potentially problematic
    item.analysis <- data.frame(item = seq(1:ctt.analysis$nItem),
                                r.pbis = ctt.analysis$pBis,
                                bis = ctt.analysis$bis,
                                item.mean = ctt.analysis$itemMean,
                                alpha.del = ctt.analysis$alphaIfDeleted)
    
    # code provided by Dr. Gordon Brooks
    if (TRUE) {
      item.analysis$check <- 
        ifelse(item.analysis$r.pbis < cvpb |
                 item.analysis$item.mean < cvdl |
                 item.analysis$item.mean > cvdu, "/", "")
    }
    
    return(item.analysis)
  }

knitr::kable(item.analysis(responses), 
             align = "c",
             caption = "Item Analysis")
## You will find additional options and better formatting using itemAnalysis().
Item Analysis
item r.pbis bis item.mean alpha.del check
1 0.1147596 0.1683232 0.7593361 0.5353385 /
2 0.1035530 0.1409292 0.7344398 0.5372914 /
3 0.0506175 0.0685887 0.2572614 0.5452346 /
4 0.0591143 0.0785356 0.6804979 0.5450389 /
5 0.1111202 0.1406669 0.3651452 0.5369402 /
6 0.3656502 0.4640379 0.5394191 0.4909063
7 0.0700444 0.0920897 0.2780083 0.5426513 /
8 0.1112357 0.1394688 0.4813278 0.5373705 /
9 0.1515853 0.2230714 0.8298755 0.5299043 /
10 -0.0434185 -0.0549382 0.4356846 0.5634988 /
11 0.1696824 0.2157830 0.3360996 0.5269375 /
12 0.1279600 0.1627060 0.5601660 0.5343645 /
13 0.2312726 0.2933878 0.3900415 0.5161192
14 0.2134979 0.2656683 0.4481328 0.5191567
15 0.1947141 0.2547864 0.3112033 0.5227779 /
16 0.3353941 0.4428850 0.6390041 0.4977310
17 0.4609036 0.5799529 0.4356846 0.4727918
18 0.4013261 0.5080745 0.3319502 0.4865119
19 0.0956187 0.1323704 0.1908714 0.5375545 /
20 0.1054536 0.1323520 0.4356846 0.5382775 /
#Compare with CTT package

item <- as.data.frame(myScore$scored)

Analyze <- itemAnalysis(item, itemReport=TRUE, NA.Delete=TRUE, pBisFlag = T,  bisFlag = T, flagStyle = c("X",""))

str(Analyze)
## List of 6
##  $ nItem     : int 20
##  $ nPerson   : int 241
##  $ alpha     : num 0.54
##  $ scaleMean : num 9.44
##  $ scaleSD   : num 3
##  $ itemReport:'data.frame':  20 obs. of  7 variables:
##   ..$ itemName      : chr [1:20] "V1" "V2" "V3" "V4" ...
##   ..$ itemMean      : num [1:20] 0.759 0.734 0.257 0.68 0.365 ...
##   ..$ pBis          : num [1:20] 0.1148 0.1036 0.0506 0.0591 0.1111 ...
##   ..$ bis           : num [1:20] 0.1576 0.1395 0.0686 0.0771 0.1423 ...
##   ..$ alphaIfDeleted: num [1:20] 0.535 0.537 0.545 0.545 0.537 ...
##   ..$ lowPBis       : chr [1:20] "X" "X" "X" "X" ...
##   ..$ lowBis        : chr [1:20] "X" "X" "X" "X" ...
##  - attr(*, "class")= chr "itemAnalysis"
knitr::kable(Analyze$itemReport, 
             align = "c",
             caption = "Item Analysis")
Item Analysis
itemName itemMean pBis bis alphaIfDeleted lowPBis lowBis
V1 0.7593361 0.1147596 0.1575704 0.5353385 X X
V2 0.7344398 0.1035530 0.1394722 0.5372914 X X
V3 0.2572614 0.0506175 0.0685887 0.5452346 X X
V4 0.6804979 0.0591143 0.0771286 0.5450389 X X
V5 0.3651452 0.1111202 0.1423179 0.5369402 X X
V6 0.5394191 0.3656502 0.4590911 0.4909063 X X
V7 0.2780083 0.0700444 0.0935474 0.5426513 X X
V8 0.4813278 0.1112357 0.1394688 0.5373705 X X
V9 0.8298755 0.1515853 0.2249732 0.5299043 X X
V10 0.4356846 -0.0434185 -0.0546771 0.5634988 X X
V11 0.3360996 0.1696824 0.2197305 0.5269375 X X
V12 0.5601660 0.1279600 0.1610437 0.5343645 X X
V13 0.3900415 0.2312726 0.2940006 0.5161192 X X
V14 0.4481328 0.2134979 0.2684080 0.5191567 X X
V15 0.3112033 0.1947141 0.2551016 0.5227779 X X
V16 0.6390041 0.3353941 0.4301674 0.4977310 X X
V17 0.4356846 0.4609036 0.5804172 0.4727918 X X
V18 0.3319502 0.4013261 0.5206309 0.4865119 X X
V19 0.1908714 0.0956187 0.1380853 0.5375545 X X
V20 0.4356846 0.1054536 0.1327980 0.5382775 X X

Item Difficulty

Item difficulty of an item is the proportion of students who answer a particular test item correctly. This index is useful in assessing whether it is appropriate for the level of the students taking the test. The desired range of item difficulty index is between 0.3 to 0.7, while the number close to 0 or 1 offers little information on measuring students’ level of the construct.

However, the extreme cut-off for item difficulty could apply to measurements that are designed for extreme groups.

library(psychometric)

Item_Difficulty <- item.exam(x = responses, y = NULL, discrim = T)

rmarkdown::paged_table(Item_Difficulty)

Item Discrimination

The item discrimination index is a measure of how well an item is able to distinguish between examinees who are knowledgeable and those who are not, or between masters and non-masters.

The possible range of the discrimination index is -1.0 to 1.0; however, if an item has a discrimination below 0.0, it suggests a problem. When an item is discriminating negatively, overall the most knowledgeable examinees are getting the item wrong and the least knowledgeable examinees are getting the item right. A negative discrimination index may indicate that the item is measuring something other than what the rest of the test is measuring. More often, it is a sign that the item has been mis-keyed.

item.discrimination <-
  function(responses)
    {
    # CRITICAL VALUES
    cvpb = 0.20
    cvdl = 0.15
    cvdu = 0.85
    
    require(CTT, warn.conflicts = FALSE, quietly = TRUE)
    ctt.analysis <- CTT::reliability(responses, itemal = TRUE, NA.Delete = TRUE)
    
    item.discrimination <- data.frame(item = 1:ctt.analysis$nItem , 
                                      discrimination = ctt.analysis$pBis)
    
    plot(item.discrimination,
         type = "p",
         pch = 1,
         cex = 3,
         col = "purple",
         ylab = "Item-Total Correlation",
         xlab = "Item Number",
         ylim = c(0, 1),
         main = "Test Item Discriminations")
    
    abline(h = cvpb, col = "red")
    
    outlier <- data.matrix(subset(item.discrimination,
                                  subset = (item.discrimination[, 2] < cvpb)))
    
    text(outlier, paste("i", outlier[,1], sep = ""), col = "red", cex = .7)
    
    return(item.discrimination[order(item.discrimination$discrimination),])
  }

item.discrimination(responses)
## You will find additional options and better formatting using itemAnalysis().

##    item discrimination
## 10   10    -0.04341855
## 3     3     0.05061749
## 4     4     0.05911432
## 7     7     0.07004435
## 19   19     0.09561872
## 2     2     0.10355300
## 20   20     0.10545357
## 5     5     0.11112015
## 8     8     0.11123568
## 1     1     0.11475958
## 12   12     0.12795997
## 9     9     0.15158531
## 11   11     0.16968243
## 15   15     0.19471413
## 14   14     0.21349786
## 13   13     0.23127262
## 16   16     0.33539410
## 6     6     0.36565023
## 18   18     0.40132606
## 17   17     0.46090362

Item-Total Correlation

The item total correlation is a correlation between the question score (e.g., 0 or 1 for multiple choice) and the overall assessment score.

It is expected that if a participant gets a question correct they should, in general, have higher overall assessment scores than participants who get a question wrong.

The relationship between question score of an item to the total score of that assessment could be attributed as item discrimination. The value of item total correlation, regarded alternatively as point-biserial, below 0.19 indicate poor item discrimination, while 0.2 and 0.39 indicate good discrimination, and 0.4 and above indicate excellent discrimination.

test_item.total <-
  function(responses)
    {
    # CRITICAL VALUES
    cvpb = 0.20
    cvdl = 0.15
    cvdu = 0.85
    
    require(CTT, warn.conflicts = FALSE, quietly = TRUE)
    ctt.analysis <- CTT::reliability(responses, itemal = TRUE, NA.Delete = TRUE)
    
    test_item.total <- data.frame(item = 1:ctt.analysis$nItem , 
                                  biserial = ctt.analysis$bis)
    
    plot(test_item.total,
         main = "Test Item-Total Correlation",
         type = "p",
         pch = 1,
         cex = 2.8,
         col = "purple",
         ylab = "Item-Total Correlation",
         xlab = "Item Number",
         ylim = c(0, 1),
         xlim = c(0, ctt.analysis$nItem))
    
    abline(h = cvpb, col = "red")
    
    outlier <- data.matrix(subset(test_item.total,
                                  subset = test_item.total[,2] < cvpb))
    
    text(outlier, paste("i", outlier[,1], sep = ""), col = "red", cex = .7)
    
    return(test_item.total[order(test_item.total$biserial),])
    }

test_item.total(responses)
## You will find additional options and better formatting using itemAnalysis().

##    item    biserial
## 10   10 -0.05493825
## 3     3  0.06858871
## 4     4  0.07853557
## 7     7  0.09208970
## 20   20  0.13235198
## 19   19  0.13237037
## 8     8  0.13946879
## 5     5  0.14066694
## 2     2  0.14092920
## 12   12  0.16270597
## 1     1  0.16832321
## 11   11  0.21578295
## 9     9  0.22307140
## 15   15  0.25478638
## 14   14  0.26566826
## 13   13  0.29338781
## 16   16  0.44288497
## 6     6  0.46403793
## 18   18  0.50807452
## 17   17  0.57995293

Distractor/Option Analysis

In distractor analysis examinees are divided into three ability levels lower, middle and upper based on their total test score. The proportions of examinees who mark each option in each of the three ability levels are compared.

In the lower ability level, we would expect to see a smaller proportion of examinees choosing the correct option and a larger proportion of them marking the incorrect options or distractors.

Ideally, good distractors would attract about the same proportion of examinees. Distractors that don’t attract any or attract very small proportion of examinees relative to other distractors should be considered for revision.

In the higher ability level, we would expect to see that the majority of examinees choose the correct option. If a distractor is more appealing than the correct option to the higher ability level examinees, then it should be eliminated or revised.

distractorAnalysis(items = data, key = key, nGroups = 4, pTable = T)
## $i001
##   correct key   n       rspP       pBis     discrim      lower      mid50
## A       *   A 183 0.75933610  0.1147596  0.22334218 0.70769231 0.63636364
## B           B  16 0.06639004 -0.2888067 -0.13846154 0.13846154 0.06060606
## C           C  22 0.09128631 -0.1975788 -0.05968170 0.07692308 0.16666667
## D           D  20 0.08298755 -0.1841578 -0.02519894 0.07692308 0.13636364
##        mid75      upper
## A 0.78846154 0.93103448
## B 0.05769231 0.00000000
## C 0.09615385 0.01724138
## D 0.05769231 0.05172414
## 
## $i002
##   correct key   n       rspP        pBis     discrim      lower      mid50
## A           A   8 0.03319502 -0.01743924  0.01909814 0.01538462 0.01515152
## B           B  40 0.16597510 -0.28940212 -0.16180371 0.23076923 0.16666667
## C       *   C 177 0.73443983  0.10355300  0.26392573 0.61538462 0.74242424
## D           D  16 0.06639004 -0.28880672 -0.12122016 0.13846154 0.07575758
##        mid75      upper
## A 0.07692308 0.03448276
## B 0.19230769 0.06896552
## C 0.71153846 0.87931034
## D 0.01923077 0.01724138
## 
## $i003
##   correct key  n      rspP        pBis     discrim     lower     mid50
## A           A 61 0.2531120 -0.21045253 -0.05278515 0.2769231 0.2878788
## B           B 74 0.3070539 -0.32875754 -0.22387268 0.4307692 0.2878788
## C       *   C 62 0.2572614  0.05061749  0.16021220 0.1846154 0.1818182
## D           D 44 0.1825726 -0.04787523  0.11644562 0.1076923 0.2424242
##       mid75     upper
## A 0.2115385 0.2241379
## B 0.2884615 0.2068966
## C 0.3461538 0.3448276
## D 0.1538462 0.2241379
## 
## $i004
##   correct key   n       rspP        pBis     discrim      lower     mid50
## A           A  11 0.04564315 -0.16015191 -0.09045093 0.10769231 0.0000000
## B           B  24 0.09958506 -0.15640748 -0.04244032 0.07692308 0.1515152
## C           C  42 0.17427386 -0.28460760 -0.15437666 0.29230769 0.1363636
## D       *   D 164 0.68049793  0.05911432  0.28726790 0.52307692 0.7121212
##        mid75      upper
## A 0.05769231 0.01724138
## B 0.13461538 0.03448276
## C 0.11538462 0.13793103
## D 0.69230769 0.81034483
## 
## $i005
##   correct key  n       rspP       pBis     discrim      lower      mid50
## A           A 84 0.34854772 -0.2097597 -0.02997347 0.32307692 0.42424242
## B       *   B 88 0.36514523  0.1111202  0.27294430 0.26153846 0.22727273
## C           C 61 0.25311203 -0.3237027 -0.19681698 0.36923077 0.30303030
## D           D  8 0.03319502 -0.1706663 -0.04615385 0.04615385 0.04545455
##        mid75     upper
## A 0.34615385 0.2931034
## B 0.48076923 0.5344828
## C 0.13461538 0.1724138
## D 0.03846154 0.0000000
## 
## $i006
##   correct key   n       rspP       pBis    discrim     lower      mid50
## A           A  61 0.25311203 -0.3587149 -0.2503979 0.3538462 0.34848485
## B       *   B 130 0.53941909  0.3656502  0.6466844 0.2153846 0.45454545
## C           C  34 0.14107884 -0.3800902 -0.2615385 0.2615385 0.16666667
## D           D  16 0.06639004 -0.2994444 -0.1347480 0.1692308 0.03030303
##        mid75      upper
## A 0.17307692 0.10344828
## B 0.69230769 0.86206897
## C 0.11538462 0.00000000
## D 0.01923077 0.03448276
## 
## $i007
##   correct key  n      rspP        pBis      discrim     lower     mid50
## A       *   A 67 0.2780083  0.07004435  0.211936340 0.1846154 0.2424242
## B           B 63 0.2614108 -0.17772394  0.027851459 0.2307692 0.3333333
## C           C 47 0.1950207 -0.13004489  0.003183024 0.1692308 0.1818182
## D           D 64 0.2655602 -0.32387947 -0.242970822 0.4153846 0.2424242
##       mid75     upper
## A 0.3076923 0.3965517
## B 0.2115385 0.2586207
## C 0.2692308 0.1724138
## D 0.2115385 0.1724138
## 
## $i008
##   correct key   n       rspP        pBis     discrim      lower      mid50
## A           A  61 0.25311203 -0.45040855 -0.34641910 0.41538462 0.34848485
## B           B  53 0.21991701 -0.11191555  0.01061008 0.23076923 0.16666667
## C       *   C 116 0.48132780  0.11123568  0.36286472 0.29230769 0.42424242
## D           D  11 0.04564315 -0.08820493 -0.02705570 0.06153846 0.06060606
##        mid75      upper
## A 0.13461538 0.06896552
## B 0.25000000 0.24137931
## C 0.59615385 0.65517241
## D 0.01923077 0.03448276
## 
## $i009
##   correct key   n       rspP       pBis     discrim      lower      mid50
## A           A   8 0.03319502 -0.1554692 -0.04244032 0.07692308 0.01515152
## B       *   B 200 0.82987552  0.1515853  0.28488064 0.64615385 0.87878788
## C           C  19 0.07883817 -0.2169620 -0.15013263 0.18461538 0.01515152
## D           D  14 0.05809129 -0.2866682 -0.09230769 0.09230769 0.09090909
##        mid75      upper
## A 0.00000000 0.03448276
## B 0.88461538 0.93103448
## C 0.07692308 0.03448276
## D 0.03846154 0.00000000
## 
## $i010
##   correct key   n       rspP        pBis     discrim     lower      mid50
## A           A  94 0.39004149 -0.13702229  0.02360743 0.3384615 0.43939394
## B           B  23 0.09543568 -0.16743006 -0.05596817 0.1076923 0.07575758
## C           C  19 0.07883817 -0.27622610 -0.13474801 0.1692308 0.04545455
## D       *   D 105 0.43568465 -0.04341855  0.16710875 0.3846154 0.43939394
##        mid75      upper
## A 0.42307692 0.36206897
## B 0.15384615 0.05172414
## C 0.05769231 0.03448276
## D 0.36538462 0.55172414
## 
## $i011
##   correct key   n       rspP       pBis     discrim     lower      mid50
## A           A 117 0.48547718 -0.3887571 -0.29151194 0.5846154 0.59090909
## B           B  23 0.09543568 -0.1398436 -0.02148541 0.1076923 0.12121212
## C       *   C  81 0.33609959  0.1696824  0.38435013 0.1846154 0.21212121
## D           D  20 0.08298755 -0.1548240 -0.07135279 0.1230769 0.07575758
##        mid75      upper
## A 0.44230769 0.29310345
## B 0.05769231 0.08620690
## C 0.42307692 0.56896552
## D 0.07692308 0.05172414
## 
## $i012
##   correct key   n       rspP       pBis     discrim     lower      mid50
## A       *   A 135 0.56016598  0.1279600  0.32413793 0.4000000 0.50000000
## B           B  59 0.24481328 -0.2253473 -0.03925729 0.2461538 0.31818182
## C           C  24 0.09958506 -0.2766477 -0.16737401 0.1846154 0.09090909
## D           D  23 0.09543568 -0.2673278 -0.11750663 0.1692308 0.09090909
##        mid75      upper
## A 0.65384615 0.72413793
## B 0.19230769 0.20689655
## C 0.09615385 0.01724138
## D 0.05769231 0.05172414
## 
## $i013
##   correct key  n      rspP       pBis    discrim     lower     mid50      mid75
## A           A 27 0.1120332 -0.2781623 -0.1501326 0.1846154 0.1363636 0.07692308
## B           B 46 0.1908714 -0.1753162 -0.1100796 0.2307692 0.1212121 0.30769231
## C           C 74 0.3070539 -0.3778479 -0.2564987 0.4461538 0.3333333 0.23076923
## D       *   D 94 0.3900415  0.2312726  0.5167109 0.1384615 0.4090909 0.38461538
##        upper
## A 0.03448276
## B 0.12068966
## C 0.18965517
## D 0.65517241
## 
## $i014
##   correct key   n      rspP       pBis    discrim     lower     mid50
## A       *   A 108 0.4481328  0.2134979  0.5358090 0.1538462 0.4393939
## B           B  31 0.1286307 -0.2535706 -0.1445623 0.2307692 0.1060606
## C           C  58 0.2406639 -0.3415163 -0.2774536 0.4153846 0.2121212
## D           D  44 0.1825726 -0.2361190 -0.1137931 0.2000000 0.2424242
##        mid75     upper
## A 0.55769231 0.6896552
## B 0.07692308 0.0862069
## C 0.17307692 0.1379310
## D 0.19230769 0.0862069
## 
## $i015
##   correct key  n       rspP       pBis     discrim      lower      mid50
## A           A 15 0.06224066 -0.1403108 -0.02148541 0.10769231 0.03030303
## B           B 74 0.30705394 -0.3315040 -0.24111406 0.43076923 0.31818182
## C       *   C 75 0.31120332  0.1947141  0.40769231 0.09230769 0.30303030
## D           D 77 0.31950207 -0.2661525 -0.14509284 0.36923077 0.34848485
##        mid75     upper
## A 0.01923077 0.0862069
## B 0.26923077 0.1896552
## C 0.38461538 0.5000000
## D 0.32692308 0.2241379
## 
## $i016
##   correct key   n       rspP       pBis     discrim      lower      mid50
## A       *   A 154 0.63900415  0.3353941  0.59071618 0.32307692 0.56060606
## B           B  39 0.16182573 -0.4136149 -0.31750663 0.36923077 0.13636364
## C           C  11 0.04564315 -0.1861371 -0.06153846 0.06153846 0.09090909
## D           D  37 0.15352697 -0.3545356 -0.21167109 0.24615385 0.21212121
##        mid75      upper
## A 0.82692308 0.91379310
## B 0.05769231 0.05172414
## C 0.01923077 0.00000000
## D 0.09615385 0.03448276
## 
## $i017
##   correct key   n      rspP       pBis    discrim     lower      mid50
## A           A  82 0.3402490 -0.4829386 -0.4350133 0.5384615 0.36363636
## B           B  26 0.1078838 -0.3113845 -0.1538462 0.1538462 0.21212121
## C       *   C 105 0.4356846  0.4609036  0.7716180 0.1076923 0.33333333
## D           D  28 0.1161826 -0.2706021 -0.1827586 0.2000000 0.09090909
##        mid75      upper
## A 0.32692308 0.10344828
## B 0.03846154 0.00000000
## C 0.48076923 0.87931034
## D 0.15384615 0.01724138
## 
## $i018
##   correct key  n       rspP       pBis    discrim      lower      mid50
## A           A 96 0.39834025 -0.4250997 -0.3180371 0.50769231 0.48484848
## B           B 43 0.17842324 -0.2755281 -0.1925729 0.26153846 0.18181818
## C           C 22 0.09128631 -0.2623175 -0.1366048 0.15384615 0.09090909
## D       *   D 80 0.33195021  0.4013261  0.6472149 0.07692308 0.24242424
##        mid75      upper
## A 0.38461538 0.18965517
## B 0.19230769 0.06896552
## C 0.09615385 0.01724138
## D 0.32692308 0.72413793
## 
## $i019
##   correct key   n       rspP        pBis     discrim     lower      mid50
## A           A  17 0.07053942 -0.14630133 -0.07135279 0.1230769 0.06060606
## B           B  48 0.19917012 -0.23586731 -0.09469496 0.2153846 0.22727273
## C           C 130 0.53941909 -0.22016521 -0.05570292 0.5384615 0.57575758
## D       *   D  46 0.19087137  0.09561872  0.22175066 0.1230769 0.13636364
##        mid75      upper
## A 0.03846154 0.05172414
## B 0.23076923 0.12068966
## C 0.55769231 0.48275862
## D 0.17307692 0.34482759
## 
## $i020
##   correct key   n       rspP       pBis     discrim      lower     mid50
## A           A  21 0.08713693 -0.2194626 -0.07506631 0.09230769 0.1515152
## B           B  49 0.20331950 -0.1948577 -0.07374005 0.24615385 0.1969697
## C       *   C 105 0.43568465  0.1054536  0.32838196 0.29230769 0.3939394
## D           D  66 0.27385892 -0.2973162 -0.17957560 0.36923077 0.2575758
##        mid75      upper
## A 0.07692308 0.01724138
## B 0.19230769 0.17241379
## C 0.46153846 0.62068966
## D 0.26923077 0.18965517

1.5 Validity of the Test

1.6 Comparison to Item Response Theory

However, CTT falls short in its sample dependency, which hampers test development in an aspect that needs an association with other population such as test equating and computerized adaptive testing. Some solutions are equipercentile equating for CTT test equating, and Thurstone absolute scaling for item-invariant measurement.