CTT (aka weak theory or true-score theory) aims to predict outcomes of the test such as item difficulty or the ability of test-takers to improve reliability of the test itself.
CTT assumes that each person has a true score that would be obtained if there were no errors in measurement. Further, this approach to psychometric employs common estimate of measurement precision as a milestone to measure test takers’ level of construct regardless of their individual attribute.
The major advantage of CTT are its relatively weak theoretical assumptions, which make CTT easy to apply in many testing situations. Relatively weak theoretical assumptions not only characterize CTT but also its extensions (e.g., generalizability theory)
setwd("D:/Class Materials & Work/Summer 2020 practice/CTT")
library(psych)
library(CTT)
library(tidyverse)
First, we will convert a test answer into a dichotomous data set (0,1) by scoring with score()
#Importing test data
data <- read_csv("sample.score.csv", col_names = T)
#Importing test key
key <- read_csv("sample.key.csv", col_names = T)
key <- as.matrix(key)
#scoring
myScore <- score(data, key, output.scored=TRUE)
#Descriptive
describe(myScore$score)
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 241 9.44 3 9 9.34 2.97 1 19 18 0.26 -0.14 0.19
#Sum score
sum_score <- as.data.frame(myScore$score)
We can also transform the score into a new scale by setting a new mean, SD, and normality with score.transform(). We will refer to the IQ score, with its mean = 100 and its SD = 15.
One can also transform the metric into a Z score (mean = 0, SD = 1), T score (mean = 50, SD = 10), or even Stanine score (mean = 5, SD = 2).
#Score transformation
IQ <- score.transform(myScore$score, mu.new = 100, sd.new = 15, normalize = T)
Item Characteristic Curves of Item 1
cttICC(score = myScore$score, itemVector = myScore$scored[,1], colTheme="spartans", cex=1.5)
We will convert our score data into matrix for further implementation.
#extract responses only
responses <- as.matrix(myScore$scored)
dimnames(responses) <- NULL
#Assign
(N <- dim(responses)[2]) # number of items
## [1] 20
(K = dim(responses)[1]) # number of test-takers
## [1] 241
Item analysis within the classical approach often relies on two statistics for evaluating single items: the P-value and the point-biserial correlation coefficient.
The P-value represents the proportion of examinees responding in the keyed direction, and is typically referred to as item difficulty.
The point-biserial correlation coefficient is a particular item’s correlation with all of the other items and it provides an index of differentiating power of the item, which is typically referred to as item discrimination.
For starter, we will compute internal consistency of the test, which encompasses Cronbach’s Alpha, KR-20, and KR-21.
For Cronbach’s Alpha:
#Cronbach Alpha and Guttman's Lambda 6
#With psych package
psych::alpha(myScore$scored, check.keys = T)
## Warning in psych::alpha(myScore$scored, check.keys = T): Some items were negatively correlated with total scale and were automatically reversed.
## This is indicated by a negative sign for the variable name.
##
## Reliability analysis
## Call: psych::alpha(x = myScore$scored, check.keys = T)
##
## raw_alpha std.alpha G6(smc) average_r S/N ase mean sd median_r
## 0.55 0.55 0.58 0.057 1.2 0.041 0.48 0.15 0.045
##
## lower alpha upper 95% confidence boundaries
## 0.47 0.55 0.63
##
## Reliability if an item is dropped:
## raw_alpha std.alpha G6(smc) average_r S/N alpha se var.r med.r
## V1 0.55 0.54 0.58 0.059 1.18 0.042 0.0081 0.045
## V2 0.56 0.55 0.58 0.060 1.21 0.041 0.0079 0.048
## V3 0.56 0.55 0.59 0.060 1.22 0.041 0.0080 0.048
## V4 0.56 0.56 0.59 0.062 1.25 0.040 0.0080 0.052
## V5 0.55 0.54 0.58 0.059 1.20 0.041 0.0081 0.048
## V6 0.51 0.50 0.54 0.050 1.00 0.046 0.0071 0.041
## V7 0.56 0.55 0.59 0.062 1.25 0.040 0.0078 0.051
## V8 0.55 0.54 0.58 0.059 1.19 0.041 0.0080 0.048
## V9 0.55 0.54 0.58 0.058 1.17 0.042 0.0080 0.048
## V10- 0.56 0.56 0.59 0.062 1.25 0.040 0.0079 0.052
## V11 0.54 0.54 0.57 0.057 1.15 0.042 0.0081 0.045
## V12 0.55 0.54 0.58 0.058 1.17 0.042 0.0080 0.044
## V13 0.53 0.52 0.56 0.055 1.10 0.043 0.0074 0.044
## V14 0.53 0.52 0.56 0.055 1.10 0.043 0.0076 0.043
## V15 0.54 0.53 0.57 0.057 1.14 0.042 0.0076 0.044
## V16 0.51 0.51 0.55 0.051 1.03 0.045 0.0068 0.044
## V17 0.49 0.48 0.52 0.047 0.93 0.048 0.0056 0.039
## V18 0.51 0.50 0.53 0.050 1.00 0.046 0.0060 0.044
## V19 0.55 0.54 0.58 0.059 1.19 0.042 0.0079 0.045
## V20 0.55 0.55 0.58 0.060 1.20 0.041 0.0077 0.048
##
## Item statistics
## n raw.r std.r r.cor r.drop mean sd
## V1 241 0.26 0.28 0.181 0.125 0.76 0.43
## V2 241 0.23 0.24 0.127 0.083 0.73 0.44
## V3 241 0.22 0.23 0.115 0.082 0.26 0.44
## V4 241 0.20 0.20 0.066 0.043 0.68 0.47
## V5 241 0.26 0.26 0.152 0.106 0.37 0.48
## V6 241 0.51 0.50 0.497 0.370 0.54 0.50
## V7 241 0.19 0.20 0.074 0.045 0.28 0.45
## V8 241 0.27 0.27 0.163 0.113 0.48 0.50
## V9 241 0.26 0.29 0.199 0.143 0.83 0.38
## V10- 241 0.21 0.20 0.067 0.043 0.56 0.50
## V11 241 0.32 0.32 0.228 0.169 0.34 0.47
## V12 241 0.31 0.30 0.205 0.148 0.56 0.50
## V13 241 0.39 0.38 0.323 0.240 0.39 0.49
## V14 241 0.40 0.38 0.327 0.247 0.45 0.50
## V15 241 0.33 0.33 0.254 0.183 0.31 0.46
## V16 241 0.47 0.47 0.452 0.337 0.64 0.48
## V17 241 0.60 0.59 0.647 0.478 0.44 0.50
## V18 241 0.52 0.50 0.526 0.387 0.33 0.47
## V19 241 0.25 0.27 0.176 0.123 0.19 0.39
## V20 241 0.26 0.25 0.152 0.097 0.44 0.50
##
## Non missing response frequency for each item
## 0 1 miss
## [1,] 0.24 0.76 0
## [2,] 0.27 0.73 0
## [3,] 0.74 0.26 0
## [4,] 0.32 0.68 0
## [5,] 0.63 0.37 0
## [6,] 0.46 0.54 0
## [7,] 0.72 0.28 0
## [8,] 0.52 0.48 0
## [9,] 0.17 0.83 0
## [10,] 0.56 0.44 0
## [11,] 0.66 0.34 0
## [12,] 0.44 0.56 0
## [13,] 0.61 0.39 0
## [14,] 0.55 0.45 0
## [15,] 0.69 0.31 0
## [16,] 0.36 0.64 0
## [17,] 0.56 0.44 0
## [18,] 0.67 0.33 0
## [19,] 0.81 0.19 0
## [20,] 0.56 0.44 0
#With a written function
cronbachs.alpha <-
function(X)
{
X <- data.matrix(X)
n <- ncol(X) # Number of items
k <- nrow(X) # Number of examinees
# Cronbachs alpha
alpha <- (n/(n - 1))*(1 - sum(apply(X, 2, var))/var(rowSums(X)))
return(list("Crombach's alpha" = alpha,
"Number of items" = n,
"Number of examinees" = k))
}
#Dump "cronbach.alpha" function for further use
dump("cronbachs.alpha", file = "cronbachs.alpha.R")
#Compute Cronbach's alpha
cronbachs.alpha(responses)
## $`Crombach's alpha`
## [1] 0.5395239
##
## $`Number of items`
## [1] 20
##
## $`Number of examinees`
## [1] 241
#Compare
CTT::reliability(responses)
## You will find additional options and better formatting using itemAnalysis().
##
## Number of Items
## 20
##
## Number of Examinees
## 241
##
## Coefficient Alpha
## 0.54
For Kuder-Richardson formula 20 (KR20) and Kuder-Richardson formula 21 (KR21)
# formula 20
KR20 <-
function(X)
{
X <- data.matrix(X)
k <- ncol(X)
# Person total score variances
SX <- var(rowSums(X))
# item means
IM <- colMeans(X)
return(((k/(k - 1))*((SX - sum(IM*(1 - IM)))/SX)))
}
KR20(responses)
## [1] 0.541653
# formula 21
KR21 <-
function(X)
{
X <- data.matrix(X)
n <- ncol(X)
return((n/(n-1))*((var(rowSums(X)) - n*(sum(colMeans(X))/n) *
(1-(sum(colMeans(X))/n))))/var(rowSums(X)))
}
KR21(responses)
## [1] 0.4700428
The test-retest reliability is an estimation of reliability based on the correlation of two equivalent forms of tests. Correlation between the split-halves is a reasonable measure of the reliability of one half of the test. Reliability of the entire test would be greater than the reliability of either half taken alone.
#Split data (cases) into two equally and randomly.
SPLIT.CASES <-
function(X, seed = NULL)
{
# optional fixed seed
if (!is.null(seed)) {set.seed(seed)}
X <- as.matrix(X)
# if k = 2x, then lengths Y1 = Y2
# if k = 2x+1, then lenths Y1 = Y2+1
k <- nrow(X)
index <- sample(1:k, ceiling(k/2))
Y1 <- X[index, ]
Y2 <- X[-index, ]
return(list(Y1, Y2))
}
#Split data (variables-item) into two equally and randomly.
SPLIT.ITEMS <-
function(X, seed = NULL)
{
# optional fixed seed
if (!is.null(seed)) {set.seed(seed)}
X <- as.matrix(X)
# if n = 2x, then lengths Y1 = Y2
# if n = 2x+1, then lenths Y1 = Y2+1
n <- ncol(X)
index <- sample(1:n, ceiling(n/2))
Y1 <- X[, index ]
Y2 <- X[, -index]
return(list(Y1, Y2))
}
dump("SPLIT.ITEMS", file = "SPLIT.ITEMS.R")
#with psych package
splitHalf(responses, raw = T)
## Warning in splitHalf(responses, raw = T): Some items were negatively correlated
## with total scale and were automatically reversed.
## Split half reliabilities
## Call: splitHalf(r = responses, raw = T)
##
## Maximum split half reliability (lambda 4) = 0.69
## Guttman lambda 6 = 0.58
## Average split half reliability = 0.55
## Guttman lambda 3 (alpha) = 0.55
## Guttman lambda 2 = 0.57
## Minimum split half reliability (beta) = 0.34
## Average interitem r = 0.06 with median = 0.04
## 2.5% 50% 97.5%
## Quantiles of split half reliability = 0.44 0.55 0.63
For spearman-brown formula, which predicts the reliability of a test after changing the test length.
# Spearman-Brown formula
SpearmanBrown <-
function(x, n1, n2)
{
source("cronbachs.alpha.R")
x <- as.matrix(x)
N <- n2/n1
# cronbach's alpha for the original test
alpha <- cronbachs.alpha(x)[[1]]
predicted.alpha <- N * alpha / (1 + (N - 1) * alpha)
return(list(original.reliability = alpha,
original.sample.size = n1,
predicted.reliability = predicted.alpha,
predicted.sample.size = n2))
}
# predict reliability by Spearman-Brown formula
# if the number of items is reduced from 25 to 15
SpearmanBrown(responses, n1 = 20, n2 = 15)
## $original.reliability
## [1] 0.5395239
##
## $original.sample.size
## [1] 20
##
## $predicted.reliability
## [1] 0.4677309
##
## $predicted.sample.size
## [1] 15
# predict reliability by Spearman-Brown formula
# if the number of items is increased from 25 to 35
SpearmanBrown(responses, n1 = 20, n2 = 35)
## $original.reliability
## [1] 0.5395239
##
## $original.sample.size
## [1] 20
##
## $predicted.reliability
## [1] 0.6721757
##
## $predicted.sample.size
## [1] 35
# predict reliability by Spearman-Brown formula
# if the number of items is doubled
SpearmanBrown(responses, n1 = 1, n2 = 2)
## $original.reliability
## [1] 0.5395239
##
## $original.sample.size
## [1] 1
##
## $predicted.reliability
## [1] 0.7008971
##
## $predicted.sample.size
## [1] 2
Guttman’s Lambda_3 (Guttman 1945)
This index works in the same way as Cronbach’s Alpha.
G3 <-
function(responses)
{
X <- as.matrix(responses)
k <- ncol(X) # number of items
SX <- var(rowSums(X)) # variance of total person scores
SI <- apply(X, 2, var) # item variances
L1 <- (1 - (sum(SI)/SX)) # Guttman's lamda 1
return((k*L1)/(k - 1))
}
# compute G3
G3(responses)
## [1] 0.5395239
#Compare
psych::guttman(responses)
## Call: psych::guttman(r = responses)
##
## Alternative estimates of reliability
##
## Guttman bounds
## L1 = 0.51
## L2 = 0.56
## L3 (alpha) = 0.53
## L4 (max) = 0.69
## L5 = 0.55
## L6 (smc) = 0.57
## TenBerge bounds
## mu0 = 0.53 mu1 = 0.56 mu2 = 0.56 mu3 = 0.56
##
## alpha of first PC = 0.64
## estimated greatest lower bound based upon communalities= 0.69
##
## beta found by splitHalf = 0.34
Pearson product-moment correlation coefficient
We will be computing a bootstrapped (1,000 replicates) Pearson product-moment correlation coefficient between two random subsets of examinees.
pearson <-
function(X, seed = NULL, n = NULL)
{
source("SPLIT.ITEMS.R")
# optional fixed seed
if (!is.null(seed)) {set.seed(seed)}
# the number of bootstrap replicates
if (is.null(n)) {n <- 1e3}
X <- as.matrix(X)
r <- rep(NA, n)
for (i in 1:n) {
# split items
Y <- SPLIT.ITEMS(X)
# total scores
S1 <- as.matrix(rowSums(Y[[1]]))
S2 <- as.matrix(rowSums(Y[[2]]))
# residual scores
R1 <- S1 - mean(S1)
R2 <- S2 - mean(S2)
# Pearson product-moment correlation coefficient
r[i] <- (t(R1) %*% R2) / (sqrt((t(R1) %*% R1)) * sqrt((t(R2) %*% R2)))
}
return(mean(r))
}
# compute the Pearson product-moment correlation coefficient
pearson(responses, seed = 456, n = 1)
## [1] 0.3499066
# compare
# split items
set.seed(456)
Y <- SPLIT.ITEMS(responses)
# total scores
S1 <- as.matrix(rowSums(Y[[1]]))
S2 <- as.matrix(rowSums(Y[[2]]))
cor(S1, S2)
## [,1]
## [1,] 0.3499066
All in all, for reliability, When scores are not tau-equivalent (for example when there is not homogeneous but rather examination items of increasing difficulty) then the KR-20 is an indication of the lower bound of internal consistency (reliability).
KR21 typically underestimates the reliability of a test compared to KR20, when the item difficulties are not equal. The formulas will be equal only if the item difficulties are equal.
Cronbach’s \(\alpha\) and Kuder-Richardson (Kuder and Richardson 1937) methods produce a lower bound for a test’s reliability, which equals the test reliability if the split-halves are essentially \(\tau\)-equivalent. These coefficients should only be used for homegeneous tests, since they reflect item homegeneity, else they will be inappropriately low.
A high KR-20 coefficient (e.g., > 0.90) provides an evidence for the assumption of a homogeneous or unidimensional test. However, it is possible, to have a high KR-20 with a multidimensional scale, especially with a large number of test items.
The Spearman-Brown coefficient can over or underestimate the test reliability if the split-halves are not parallel tests. When the tests are parallel the Spearman-Brown formula is useful to test the effects of test length on reliability.
A test can produce different reliability estimates when administered to different samples of examinees. Generazibility Theory examines such systematic effects on reliability.
The standard error of measurement (SEm) is a measure of how much measured test scores are spread around a “true” score. Standard error of measurement using Cronbach’s alpha as the reliability statistic cab be computed as:
SEM <-
function(X){
source("cronbachs.alpha.R")
X <- data.matrix(X)
return(sd(rowSums(X)) * sqrt(1 - cronbachs.alpha(X)[[1]]))
}
SEM(responses)
## [1] 2.036401
As a result, a person’s true score in 68% CI would be ± 2.036401.
Note that the critical-value for a 90% confidence interval is 1.65.
# 90% confidence interval for the true score
head(cbind(lower_bound = round(rowSums(responses)-1.65* sd(rowSums(responses))*
sqrt(1-KR20(responses)), 2), observed = rowSums(responses),
upper_bound = round(rowSums(responses)+1.65* sd(rowSums(responses))*
sqrt(1-KR20(responses)), 2)), 20)
## lower_bound observed upper_bound
## [1,] 5.65 9 12.35
## [2,] 2.65 6 9.35
## [3,] 9.65 13 16.35
## [4,] 5.65 9 12.35
## [5,] 10.65 14 17.35
## [6,] 8.65 12 15.35
## [7,] 3.65 7 10.35
## [8,] 5.65 9 12.35
## [9,] 4.65 8 11.35
## [10,] 7.65 11 14.35
## [11,] 2.65 6 9.35
## [12,] 4.65 8 11.35
## [13,] 1.65 5 8.35
## [14,] 9.65 13 16.35
## [15,] 4.65 8 11.35
## [16,] 1.65 5 8.35
## [17,] 11.65 15 18.35
## [18,] 4.65 8 11.35
## [19,] 6.65 10 13.35
## [20,] 9.65 13 16.35
Item analysis within the classical approach often relies on two statistics for evaluating single items: the P-value and the point-biserial correlation coefficient.
The P-value represents the proportion of examinees responding in the keyed direction, and is typically referred to as item difficulty.
The point-biserial correlation coefficient is an item’s correlation with all of the other items. It provides an index of differentiating power of the item, which is typically referred to as item discrimination.
item.analysis <-
function(responses)
{
# CRITICAL VALUES
cvpb = 0.20
cvdl = 0.15
cvdu = 0.85
require(CTT, warn.conflicts = FALSE, quietly = TRUE)
(ctt.analysis <- CTT::reliability(responses, itemal = TRUE, NA.Delete = TRUE))
# Mark items that are potentially problematic
item.analysis <- data.frame(item = seq(1:ctt.analysis$nItem),
r.pbis = ctt.analysis$pBis,
bis = ctt.analysis$bis,
item.mean = ctt.analysis$itemMean,
alpha.del = ctt.analysis$alphaIfDeleted)
# code provided by Dr. Gordon Brooks
if (TRUE) {
item.analysis$check <-
ifelse(item.analysis$r.pbis < cvpb |
item.analysis$item.mean < cvdl |
item.analysis$item.mean > cvdu, "/", "")
}
return(item.analysis)
}
knitr::kable(item.analysis(responses),
align = "c",
caption = "Item Analysis")
## You will find additional options and better formatting using itemAnalysis().
| item | r.pbis | bis | item.mean | alpha.del | check |
|---|---|---|---|---|---|
| 1 | 0.1147596 | 0.1683232 | 0.7593361 | 0.5353385 | / |
| 2 | 0.1035530 | 0.1409292 | 0.7344398 | 0.5372914 | / |
| 3 | 0.0506175 | 0.0685887 | 0.2572614 | 0.5452346 | / |
| 4 | 0.0591143 | 0.0785356 | 0.6804979 | 0.5450389 | / |
| 5 | 0.1111202 | 0.1406669 | 0.3651452 | 0.5369402 | / |
| 6 | 0.3656502 | 0.4640379 | 0.5394191 | 0.4909063 | |
| 7 | 0.0700444 | 0.0920897 | 0.2780083 | 0.5426513 | / |
| 8 | 0.1112357 | 0.1394688 | 0.4813278 | 0.5373705 | / |
| 9 | 0.1515853 | 0.2230714 | 0.8298755 | 0.5299043 | / |
| 10 | -0.0434185 | -0.0549382 | 0.4356846 | 0.5634988 | / |
| 11 | 0.1696824 | 0.2157830 | 0.3360996 | 0.5269375 | / |
| 12 | 0.1279600 | 0.1627060 | 0.5601660 | 0.5343645 | / |
| 13 | 0.2312726 | 0.2933878 | 0.3900415 | 0.5161192 | |
| 14 | 0.2134979 | 0.2656683 | 0.4481328 | 0.5191567 | |
| 15 | 0.1947141 | 0.2547864 | 0.3112033 | 0.5227779 | / |
| 16 | 0.3353941 | 0.4428850 | 0.6390041 | 0.4977310 | |
| 17 | 0.4609036 | 0.5799529 | 0.4356846 | 0.4727918 | |
| 18 | 0.4013261 | 0.5080745 | 0.3319502 | 0.4865119 | |
| 19 | 0.0956187 | 0.1323704 | 0.1908714 | 0.5375545 | / |
| 20 | 0.1054536 | 0.1323520 | 0.4356846 | 0.5382775 | / |
#Compare with CTT package
item <- as.data.frame(myScore$scored)
Analyze <- itemAnalysis(item, itemReport=TRUE, NA.Delete=TRUE, pBisFlag = T, bisFlag = T, flagStyle = c("X",""))
str(Analyze)
## List of 6
## $ nItem : int 20
## $ nPerson : int 241
## $ alpha : num 0.54
## $ scaleMean : num 9.44
## $ scaleSD : num 3
## $ itemReport:'data.frame': 20 obs. of 7 variables:
## ..$ itemName : chr [1:20] "V1" "V2" "V3" "V4" ...
## ..$ itemMean : num [1:20] 0.759 0.734 0.257 0.68 0.365 ...
## ..$ pBis : num [1:20] 0.1148 0.1036 0.0506 0.0591 0.1111 ...
## ..$ bis : num [1:20] 0.1576 0.1395 0.0686 0.0771 0.1423 ...
## ..$ alphaIfDeleted: num [1:20] 0.535 0.537 0.545 0.545 0.537 ...
## ..$ lowPBis : chr [1:20] "X" "X" "X" "X" ...
## ..$ lowBis : chr [1:20] "X" "X" "X" "X" ...
## - attr(*, "class")= chr "itemAnalysis"
knitr::kable(Analyze$itemReport,
align = "c",
caption = "Item Analysis")
| itemName | itemMean | pBis | bis | alphaIfDeleted | lowPBis | lowBis |
|---|---|---|---|---|---|---|
| V1 | 0.7593361 | 0.1147596 | 0.1575704 | 0.5353385 | X | X |
| V2 | 0.7344398 | 0.1035530 | 0.1394722 | 0.5372914 | X | X |
| V3 | 0.2572614 | 0.0506175 | 0.0685887 | 0.5452346 | X | X |
| V4 | 0.6804979 | 0.0591143 | 0.0771286 | 0.5450389 | X | X |
| V5 | 0.3651452 | 0.1111202 | 0.1423179 | 0.5369402 | X | X |
| V6 | 0.5394191 | 0.3656502 | 0.4590911 | 0.4909063 | X | X |
| V7 | 0.2780083 | 0.0700444 | 0.0935474 | 0.5426513 | X | X |
| V8 | 0.4813278 | 0.1112357 | 0.1394688 | 0.5373705 | X | X |
| V9 | 0.8298755 | 0.1515853 | 0.2249732 | 0.5299043 | X | X |
| V10 | 0.4356846 | -0.0434185 | -0.0546771 | 0.5634988 | X | X |
| V11 | 0.3360996 | 0.1696824 | 0.2197305 | 0.5269375 | X | X |
| V12 | 0.5601660 | 0.1279600 | 0.1610437 | 0.5343645 | X | X |
| V13 | 0.3900415 | 0.2312726 | 0.2940006 | 0.5161192 | X | X |
| V14 | 0.4481328 | 0.2134979 | 0.2684080 | 0.5191567 | X | X |
| V15 | 0.3112033 | 0.1947141 | 0.2551016 | 0.5227779 | X | X |
| V16 | 0.6390041 | 0.3353941 | 0.4301674 | 0.4977310 | X | X |
| V17 | 0.4356846 | 0.4609036 | 0.5804172 | 0.4727918 | X | X |
| V18 | 0.3319502 | 0.4013261 | 0.5206309 | 0.4865119 | X | X |
| V19 | 0.1908714 | 0.0956187 | 0.1380853 | 0.5375545 | X | X |
| V20 | 0.4356846 | 0.1054536 | 0.1327980 | 0.5382775 | X | X |
Item difficulty of an item is the proportion of students who answer a particular test item correctly. This index is useful in assessing whether it is appropriate for the level of the students taking the test. The desired range of item difficulty index is between 0.3 to 0.7, while the number close to 0 or 1 offers little information on measuring students’ level of the construct.
However, the extreme cut-off for item difficulty could apply to measurements that are designed for extreme groups.
library(psychometric)
Item_Difficulty <- item.exam(x = responses, y = NULL, discrim = T)
rmarkdown::paged_table(Item_Difficulty)
The item discrimination index is a measure of how well an item is able to distinguish between examinees who are knowledgeable and those who are not, or between masters and non-masters.
The possible range of the discrimination index is -1.0 to 1.0; however, if an item has a discrimination below 0.0, it suggests a problem. When an item is discriminating negatively, overall the most knowledgeable examinees are getting the item wrong and the least knowledgeable examinees are getting the item right. A negative discrimination index may indicate that the item is measuring something other than what the rest of the test is measuring. More often, it is a sign that the item has been mis-keyed.
item.discrimination <-
function(responses)
{
# CRITICAL VALUES
cvpb = 0.20
cvdl = 0.15
cvdu = 0.85
require(CTT, warn.conflicts = FALSE, quietly = TRUE)
ctt.analysis <- CTT::reliability(responses, itemal = TRUE, NA.Delete = TRUE)
item.discrimination <- data.frame(item = 1:ctt.analysis$nItem ,
discrimination = ctt.analysis$pBis)
plot(item.discrimination,
type = "p",
pch = 1,
cex = 3,
col = "purple",
ylab = "Item-Total Correlation",
xlab = "Item Number",
ylim = c(0, 1),
main = "Test Item Discriminations")
abline(h = cvpb, col = "red")
outlier <- data.matrix(subset(item.discrimination,
subset = (item.discrimination[, 2] < cvpb)))
text(outlier, paste("i", outlier[,1], sep = ""), col = "red", cex = .7)
return(item.discrimination[order(item.discrimination$discrimination),])
}
item.discrimination(responses)
## You will find additional options and better formatting using itemAnalysis().
## item discrimination
## 10 10 -0.04341855
## 3 3 0.05061749
## 4 4 0.05911432
## 7 7 0.07004435
## 19 19 0.09561872
## 2 2 0.10355300
## 20 20 0.10545357
## 5 5 0.11112015
## 8 8 0.11123568
## 1 1 0.11475958
## 12 12 0.12795997
## 9 9 0.15158531
## 11 11 0.16968243
## 15 15 0.19471413
## 14 14 0.21349786
## 13 13 0.23127262
## 16 16 0.33539410
## 6 6 0.36565023
## 18 18 0.40132606
## 17 17 0.46090362
The item total correlation is a correlation between the question score (e.g., 0 or 1 for multiple choice) and the overall assessment score.
It is expected that if a participant gets a question correct they should, in general, have higher overall assessment scores than participants who get a question wrong.
The relationship between question score of an item to the total score of that assessment could be attributed as item discrimination. The value of item total correlation, regarded alternatively as point-biserial, below 0.19 indicate poor item discrimination, while 0.2 and 0.39 indicate good discrimination, and 0.4 and above indicate excellent discrimination.
test_item.total <-
function(responses)
{
# CRITICAL VALUES
cvpb = 0.20
cvdl = 0.15
cvdu = 0.85
require(CTT, warn.conflicts = FALSE, quietly = TRUE)
ctt.analysis <- CTT::reliability(responses, itemal = TRUE, NA.Delete = TRUE)
test_item.total <- data.frame(item = 1:ctt.analysis$nItem ,
biserial = ctt.analysis$bis)
plot(test_item.total,
main = "Test Item-Total Correlation",
type = "p",
pch = 1,
cex = 2.8,
col = "purple",
ylab = "Item-Total Correlation",
xlab = "Item Number",
ylim = c(0, 1),
xlim = c(0, ctt.analysis$nItem))
abline(h = cvpb, col = "red")
outlier <- data.matrix(subset(test_item.total,
subset = test_item.total[,2] < cvpb))
text(outlier, paste("i", outlier[,1], sep = ""), col = "red", cex = .7)
return(test_item.total[order(test_item.total$biserial),])
}
test_item.total(responses)
## You will find additional options and better formatting using itemAnalysis().
## item biserial
## 10 10 -0.05493825
## 3 3 0.06858871
## 4 4 0.07853557
## 7 7 0.09208970
## 20 20 0.13235198
## 19 19 0.13237037
## 8 8 0.13946879
## 5 5 0.14066694
## 2 2 0.14092920
## 12 12 0.16270597
## 1 1 0.16832321
## 11 11 0.21578295
## 9 9 0.22307140
## 15 15 0.25478638
## 14 14 0.26566826
## 13 13 0.29338781
## 16 16 0.44288497
## 6 6 0.46403793
## 18 18 0.50807452
## 17 17 0.57995293
In distractor analysis examinees are divided into three ability levels lower, middle and upper based on their total test score. The proportions of examinees who mark each option in each of the three ability levels are compared.
In the lower ability level, we would expect to see a smaller proportion of examinees choosing the correct option and a larger proportion of them marking the incorrect options or distractors.
Ideally, good distractors would attract about the same proportion of examinees. Distractors that don’t attract any or attract very small proportion of examinees relative to other distractors should be considered for revision.
In the higher ability level, we would expect to see that the majority of examinees choose the correct option. If a distractor is more appealing than the correct option to the higher ability level examinees, then it should be eliminated or revised.
distractorAnalysis(items = data, key = key, nGroups = 4, pTable = T)
## $i001
## correct key n rspP pBis discrim lower mid50
## A * A 183 0.75933610 0.1147596 0.22334218 0.70769231 0.63636364
## B B 16 0.06639004 -0.2888067 -0.13846154 0.13846154 0.06060606
## C C 22 0.09128631 -0.1975788 -0.05968170 0.07692308 0.16666667
## D D 20 0.08298755 -0.1841578 -0.02519894 0.07692308 0.13636364
## mid75 upper
## A 0.78846154 0.93103448
## B 0.05769231 0.00000000
## C 0.09615385 0.01724138
## D 0.05769231 0.05172414
##
## $i002
## correct key n rspP pBis discrim lower mid50
## A A 8 0.03319502 -0.01743924 0.01909814 0.01538462 0.01515152
## B B 40 0.16597510 -0.28940212 -0.16180371 0.23076923 0.16666667
## C * C 177 0.73443983 0.10355300 0.26392573 0.61538462 0.74242424
## D D 16 0.06639004 -0.28880672 -0.12122016 0.13846154 0.07575758
## mid75 upper
## A 0.07692308 0.03448276
## B 0.19230769 0.06896552
## C 0.71153846 0.87931034
## D 0.01923077 0.01724138
##
## $i003
## correct key n rspP pBis discrim lower mid50
## A A 61 0.2531120 -0.21045253 -0.05278515 0.2769231 0.2878788
## B B 74 0.3070539 -0.32875754 -0.22387268 0.4307692 0.2878788
## C * C 62 0.2572614 0.05061749 0.16021220 0.1846154 0.1818182
## D D 44 0.1825726 -0.04787523 0.11644562 0.1076923 0.2424242
## mid75 upper
## A 0.2115385 0.2241379
## B 0.2884615 0.2068966
## C 0.3461538 0.3448276
## D 0.1538462 0.2241379
##
## $i004
## correct key n rspP pBis discrim lower mid50
## A A 11 0.04564315 -0.16015191 -0.09045093 0.10769231 0.0000000
## B B 24 0.09958506 -0.15640748 -0.04244032 0.07692308 0.1515152
## C C 42 0.17427386 -0.28460760 -0.15437666 0.29230769 0.1363636
## D * D 164 0.68049793 0.05911432 0.28726790 0.52307692 0.7121212
## mid75 upper
## A 0.05769231 0.01724138
## B 0.13461538 0.03448276
## C 0.11538462 0.13793103
## D 0.69230769 0.81034483
##
## $i005
## correct key n rspP pBis discrim lower mid50
## A A 84 0.34854772 -0.2097597 -0.02997347 0.32307692 0.42424242
## B * B 88 0.36514523 0.1111202 0.27294430 0.26153846 0.22727273
## C C 61 0.25311203 -0.3237027 -0.19681698 0.36923077 0.30303030
## D D 8 0.03319502 -0.1706663 -0.04615385 0.04615385 0.04545455
## mid75 upper
## A 0.34615385 0.2931034
## B 0.48076923 0.5344828
## C 0.13461538 0.1724138
## D 0.03846154 0.0000000
##
## $i006
## correct key n rspP pBis discrim lower mid50
## A A 61 0.25311203 -0.3587149 -0.2503979 0.3538462 0.34848485
## B * B 130 0.53941909 0.3656502 0.6466844 0.2153846 0.45454545
## C C 34 0.14107884 -0.3800902 -0.2615385 0.2615385 0.16666667
## D D 16 0.06639004 -0.2994444 -0.1347480 0.1692308 0.03030303
## mid75 upper
## A 0.17307692 0.10344828
## B 0.69230769 0.86206897
## C 0.11538462 0.00000000
## D 0.01923077 0.03448276
##
## $i007
## correct key n rspP pBis discrim lower mid50
## A * A 67 0.2780083 0.07004435 0.211936340 0.1846154 0.2424242
## B B 63 0.2614108 -0.17772394 0.027851459 0.2307692 0.3333333
## C C 47 0.1950207 -0.13004489 0.003183024 0.1692308 0.1818182
## D D 64 0.2655602 -0.32387947 -0.242970822 0.4153846 0.2424242
## mid75 upper
## A 0.3076923 0.3965517
## B 0.2115385 0.2586207
## C 0.2692308 0.1724138
## D 0.2115385 0.1724138
##
## $i008
## correct key n rspP pBis discrim lower mid50
## A A 61 0.25311203 -0.45040855 -0.34641910 0.41538462 0.34848485
## B B 53 0.21991701 -0.11191555 0.01061008 0.23076923 0.16666667
## C * C 116 0.48132780 0.11123568 0.36286472 0.29230769 0.42424242
## D D 11 0.04564315 -0.08820493 -0.02705570 0.06153846 0.06060606
## mid75 upper
## A 0.13461538 0.06896552
## B 0.25000000 0.24137931
## C 0.59615385 0.65517241
## D 0.01923077 0.03448276
##
## $i009
## correct key n rspP pBis discrim lower mid50
## A A 8 0.03319502 -0.1554692 -0.04244032 0.07692308 0.01515152
## B * B 200 0.82987552 0.1515853 0.28488064 0.64615385 0.87878788
## C C 19 0.07883817 -0.2169620 -0.15013263 0.18461538 0.01515152
## D D 14 0.05809129 -0.2866682 -0.09230769 0.09230769 0.09090909
## mid75 upper
## A 0.00000000 0.03448276
## B 0.88461538 0.93103448
## C 0.07692308 0.03448276
## D 0.03846154 0.00000000
##
## $i010
## correct key n rspP pBis discrim lower mid50
## A A 94 0.39004149 -0.13702229 0.02360743 0.3384615 0.43939394
## B B 23 0.09543568 -0.16743006 -0.05596817 0.1076923 0.07575758
## C C 19 0.07883817 -0.27622610 -0.13474801 0.1692308 0.04545455
## D * D 105 0.43568465 -0.04341855 0.16710875 0.3846154 0.43939394
## mid75 upper
## A 0.42307692 0.36206897
## B 0.15384615 0.05172414
## C 0.05769231 0.03448276
## D 0.36538462 0.55172414
##
## $i011
## correct key n rspP pBis discrim lower mid50
## A A 117 0.48547718 -0.3887571 -0.29151194 0.5846154 0.59090909
## B B 23 0.09543568 -0.1398436 -0.02148541 0.1076923 0.12121212
## C * C 81 0.33609959 0.1696824 0.38435013 0.1846154 0.21212121
## D D 20 0.08298755 -0.1548240 -0.07135279 0.1230769 0.07575758
## mid75 upper
## A 0.44230769 0.29310345
## B 0.05769231 0.08620690
## C 0.42307692 0.56896552
## D 0.07692308 0.05172414
##
## $i012
## correct key n rspP pBis discrim lower mid50
## A * A 135 0.56016598 0.1279600 0.32413793 0.4000000 0.50000000
## B B 59 0.24481328 -0.2253473 -0.03925729 0.2461538 0.31818182
## C C 24 0.09958506 -0.2766477 -0.16737401 0.1846154 0.09090909
## D D 23 0.09543568 -0.2673278 -0.11750663 0.1692308 0.09090909
## mid75 upper
## A 0.65384615 0.72413793
## B 0.19230769 0.20689655
## C 0.09615385 0.01724138
## D 0.05769231 0.05172414
##
## $i013
## correct key n rspP pBis discrim lower mid50 mid75
## A A 27 0.1120332 -0.2781623 -0.1501326 0.1846154 0.1363636 0.07692308
## B B 46 0.1908714 -0.1753162 -0.1100796 0.2307692 0.1212121 0.30769231
## C C 74 0.3070539 -0.3778479 -0.2564987 0.4461538 0.3333333 0.23076923
## D * D 94 0.3900415 0.2312726 0.5167109 0.1384615 0.4090909 0.38461538
## upper
## A 0.03448276
## B 0.12068966
## C 0.18965517
## D 0.65517241
##
## $i014
## correct key n rspP pBis discrim lower mid50
## A * A 108 0.4481328 0.2134979 0.5358090 0.1538462 0.4393939
## B B 31 0.1286307 -0.2535706 -0.1445623 0.2307692 0.1060606
## C C 58 0.2406639 -0.3415163 -0.2774536 0.4153846 0.2121212
## D D 44 0.1825726 -0.2361190 -0.1137931 0.2000000 0.2424242
## mid75 upper
## A 0.55769231 0.6896552
## B 0.07692308 0.0862069
## C 0.17307692 0.1379310
## D 0.19230769 0.0862069
##
## $i015
## correct key n rspP pBis discrim lower mid50
## A A 15 0.06224066 -0.1403108 -0.02148541 0.10769231 0.03030303
## B B 74 0.30705394 -0.3315040 -0.24111406 0.43076923 0.31818182
## C * C 75 0.31120332 0.1947141 0.40769231 0.09230769 0.30303030
## D D 77 0.31950207 -0.2661525 -0.14509284 0.36923077 0.34848485
## mid75 upper
## A 0.01923077 0.0862069
## B 0.26923077 0.1896552
## C 0.38461538 0.5000000
## D 0.32692308 0.2241379
##
## $i016
## correct key n rspP pBis discrim lower mid50
## A * A 154 0.63900415 0.3353941 0.59071618 0.32307692 0.56060606
## B B 39 0.16182573 -0.4136149 -0.31750663 0.36923077 0.13636364
## C C 11 0.04564315 -0.1861371 -0.06153846 0.06153846 0.09090909
## D D 37 0.15352697 -0.3545356 -0.21167109 0.24615385 0.21212121
## mid75 upper
## A 0.82692308 0.91379310
## B 0.05769231 0.05172414
## C 0.01923077 0.00000000
## D 0.09615385 0.03448276
##
## $i017
## correct key n rspP pBis discrim lower mid50
## A A 82 0.3402490 -0.4829386 -0.4350133 0.5384615 0.36363636
## B B 26 0.1078838 -0.3113845 -0.1538462 0.1538462 0.21212121
## C * C 105 0.4356846 0.4609036 0.7716180 0.1076923 0.33333333
## D D 28 0.1161826 -0.2706021 -0.1827586 0.2000000 0.09090909
## mid75 upper
## A 0.32692308 0.10344828
## B 0.03846154 0.00000000
## C 0.48076923 0.87931034
## D 0.15384615 0.01724138
##
## $i018
## correct key n rspP pBis discrim lower mid50
## A A 96 0.39834025 -0.4250997 -0.3180371 0.50769231 0.48484848
## B B 43 0.17842324 -0.2755281 -0.1925729 0.26153846 0.18181818
## C C 22 0.09128631 -0.2623175 -0.1366048 0.15384615 0.09090909
## D * D 80 0.33195021 0.4013261 0.6472149 0.07692308 0.24242424
## mid75 upper
## A 0.38461538 0.18965517
## B 0.19230769 0.06896552
## C 0.09615385 0.01724138
## D 0.32692308 0.72413793
##
## $i019
## correct key n rspP pBis discrim lower mid50
## A A 17 0.07053942 -0.14630133 -0.07135279 0.1230769 0.06060606
## B B 48 0.19917012 -0.23586731 -0.09469496 0.2153846 0.22727273
## C C 130 0.53941909 -0.22016521 -0.05570292 0.5384615 0.57575758
## D * D 46 0.19087137 0.09561872 0.22175066 0.1230769 0.13636364
## mid75 upper
## A 0.03846154 0.05172414
## B 0.23076923 0.12068966
## C 0.55769231 0.48275862
## D 0.17307692 0.34482759
##
## $i020
## correct key n rspP pBis discrim lower mid50
## A A 21 0.08713693 -0.2194626 -0.07506631 0.09230769 0.1515152
## B B 49 0.20331950 -0.1948577 -0.07374005 0.24615385 0.1969697
## C * C 105 0.43568465 0.1054536 0.32838196 0.29230769 0.3939394
## D D 66 0.27385892 -0.2973162 -0.17957560 0.36923077 0.2575758
## mid75 upper
## A 0.07692308 0.01724138
## B 0.19230769 0.17241379
## C 0.46153846 0.62068966
## D 0.26923077 0.18965517
However, CTT falls short in its sample dependency, which hampers test development in an aspect that needs an association with other population such as test equating and computerized adaptive testing. Some solutions are equipercentile equating for CTT test equating, and Thurstone absolute scaling for item-invariant measurement.