Classical Test Theory (CTT) Item Analysis in R: using ‘psych’ package

Federico Ferrero

1. Purpose of this tutorial

This presentation simulates a realistic multiple-choice test, conduct a Classical Test Theory (CTT) item analysis, and interpret item-level and test-level statistics using R. The workflow mirrors what is typically done in operational assessment and evaluation contexts.

2. Setup

Clearing the workspace prevents interference from existing objects. The psych package is commonly used for CTT analyses, and setting a seed ensures reproducible results.

# Clean workspace
rm(list = ls())

# Load required library
library(psych)

## Warning: package 'psych' was built under R version 4.4.3

# Reproducibility
set.seed(123)

3. Simulation parameters

Two hundred examinees and fifteen items approximate a typical pilot or short operational test. Each item has four options beins A the correct one. The latent ability variable (theta) is used only to generate realistic response behavior; in CTT, ability is not directly observed.

n_persons <- 200     # Number of test takers
n_items   <- 15      # Number of test items

options <- c("A", "B", "C", "D")
correct_key <- rep("A", n_items)

# Simulated latent ability (not observed in practice)
theta <- rnorm(n_persons, mean = 0, sd = 1)

4. Simulating a multiple-choice item

This function generates item responses such that higher-ability students are more likely to answer correctly, while lower-ability students choose among distractors.Distractors are intentionally uneven to reflect realistic misconception patterns. Although the simulation is inspired by IRT, the analysis remains purely CTT-based.

simulate_item <- function(theta, a_range = c(1.0, 1.5), b_range = c(-1, 1)) {
  sapply(theta, function(t) {
    a <- runif(1, a_range[1], a_range[2])  # discrimination
    b <- runif(1, b_range[1], b_range[2])  # difficulty
    
    # Probability of a correct response
    p <- 1 / (1 + exp(-a * (t - b)))
    correct <- rbinom(1, 1, prob = p)
    
    if (correct == 1) {
      "A"
    } else {
      distractors <- options[options != "A"]
      probs <- c(0.5, 0.3, 0.2)  # Unequal distractor attractiveness
      sample(distractors, 1, prob = probs)
    }
  })
}

5. Generating the test: Item 5 as problematic one

Item 5 is deliberately constructed to be easier and less discriminating. This allows us to demonstrate how CTT and distractor analyses help identify items that require revision.

mc_data <- sapply(1:n_items, function(i) {
  if (i == 5) {
    # Intentionally weaker item
    simulate_item(theta,
                  a_range = c(0.6, 0.9),
                  b_range = c(-1.5, -0.5))
  } else {
    simulate_item(theta)
  }
})

mc_data <- as.data.frame(mc_data)
colnames(mc_data) <- paste0("Item_", 1:n_items)

The table below shows the first five columns of the multiple-choice response dataset (mc_data): it contains raw multiple-choice responses (A–D).

First 5 rows of the simulated test data
Item_1	Item_2	Item_3	Item_4	Item_5	Item_6	Item_7	Item_8	Item_9	Item_10	Item_11	Item_12	Item_13	Item_14	Item_15
C	A	A	D	A	D	B	B	D	B	B	D	C	C	D
A	A	D	B	A	D	A	A	B	B	A	B	A	A	C
A	A	A	A	A	A	A	A	A	A	A	D	A	A	A
B	C	A	C	A	B	D	B	B	A	D	B	A	D	A
B	D	C	B	C	A	A	A	C	C	A	A	D	B	C

6. Scoring the test

Under CTT assumptions, responses are dichotomously scored: 1 for correct and 0 for incorrect. No modeling or partial credit is used.

scored_data <- as.data.frame(
  sapply(mc_data, function(x) as.numeric(x == "A"))
)

The table below shows the first five columns of the corresponding scored dataset (scored_data). This step is used to verify data structure and alignment before conducting Classical Test Theory (CTT) item analysis.One row, one student. One column per item. 0 or 1 scoring. No missing data rate per item.

First 5 rows of the scored test data
Item_1	Item_2	Item_3	Item_4	Item_5	Item_6	Item_7	Item_8	Item_9	Item_10	Item_11	Item_12	Item_13	Item_14	Item_15
0	1	1	0	1	0	0	0	0	0	0	0	0	0	0
1	1	0	0	1	0	1	1	0	0	1	0	1	1	0
1	1	1	1	1	1	1	1	1	1	1	0	1	1	1
0	0	1	0	1	0	0	0	0	1	0	0	1	0	1
0	0	0	0	0	1	1	1	0	0	1	1	0	0	0

7. Reliability and item statistics

alpha_results <- alpha(scored_data)

item_difficulty     <- colMeans(scored_data)
item_discrimination <- alpha_results$item.stats$r.drop
alpha_if_deleted    <- alpha_results$alpha.drop$raw_alpha

Statistic	Definition / Description	Interpretation / Notes
Item difficulty (p-value)	Proportion of students who answered correctly	< .20 → too hard .30–.70 → desirable > .80 → too easy
Item discrimination (point biserial) (r.drop)	Correlation between item score and total test score (excluding itself)	< .20 → weak .30–.39 → acceptable ≥ .40 → strong
Cronbach’s alpha	Internal consistency of the test	Sensitive to number of items and item quality
Alpha if item deleted	Would reliability improve if this item were removed?	—

8. CTT summary table

ctt_summary <- data.frame(
  Item = colnames(scored_data),
  Difficulty = round(item_difficulty, 2),
  Discrimination = round(item_discrimination, 2),
  Alpha_if_deleted = round(alpha_if_deleted, 2)
)

print(ctt_summary)

##            Item Difficulty Discrimination Alpha_if_deleted
## Item_1   Item_1       0.54           0.44             0.75
## Item_2   Item_2       0.54           0.34             0.76
## Item_3   Item_3       0.46           0.45             0.75
## Item_4   Item_4       0.47           0.41             0.76
## Item_5   Item_5       0.60           0.25             0.77
## Item_6   Item_6       0.50           0.38             0.76
## Item_7   Item_7       0.50           0.42             0.75
## Item_8   Item_8       0.54           0.37             0.76
## Item_9   Item_9       0.50           0.37             0.76
## Item_10 Item_10       0.45           0.30             0.76
## Item_11 Item_11       0.51           0.33             0.76
## Item_12 Item_12       0.52           0.38             0.76
## Item_13 Item_13       0.47           0.42             0.75
## Item_14 Item_14       0.49           0.38             0.76
## Item_15 Item_15       0.47           0.36             0.76

cat("Overall Cronbach's alpha:",
    round(alpha_results$total$raw_alpha, 2))

## Overall Cronbach's alpha: 0.77

Analysis

Overall Cronbach’s alpha = 0.77, which is acceptable reliability for a short 15-item test.
Item difficulty: Values range from 0.45 to 0.60, which is well within the desirable range (.30–.70). This means most items are neither too hard nor too easy.
Item Discrimination (r.drop): Most items have discrimination between 0.30–0.45, which is acceptable to strong. Item 5 has 0.25, which is noticeably lower than the others.
Alpha if item deleted: Values range from 0.75 to 0.77, very close to the overall alpha (0.77). Removing any single item would not substantially improve reliability.

Overall: The test is well-constructed with good internal consistency.

Item 5 is weaker in discrimination and slightly easier, so it could be reviewed or revised.

All other items are functioning well, contributing positively to reliability.

9. Distractor analysis (Item 5)

Functional distractor

Chosen by a reasonable proportion (≈5–30%)
Mean total score below the correct option

Non-functional distractor

Rarely chosen (<5–10%)
Chosen almost exclusively by very low scorers

item5 <- mc_data$Item_5
total_score <- rowSums(scored_data)

# Option frequencies
prop.table(table(item5))

## item5
##     A     B     C     D 
## 0.605 0.185 0.135 0.075

A (correct answer): 60.5% – This is a reasonable proportion, so the item is moderately easy.
B: 18.5%, C: 13.5%, D: 7.5% – These are the distractors.
B and C are functioning distractors, chosen by a fair number of students.
D is a non-functional distractor, rarely chosen (<10%).

# Mean total score by option

aggregate(total_score,
          by = list(Option = item5),
          mean)

##   Option        x
## 1      A 8.661157
## 2      B 6.270270
## 3      C 6.037037
## 4      D 4.666667

A (correct answer): 8.66 – Students who answered correctly had the highest total scores, as expected.
B and C: Students choosing these distractors scored moderately lower (6.27, 6.04), indicating these distractors attract mid-level students.
D: Students choosing this distractor had the lowest scores (4.67), confirming it is non-functional; only the lowest-scoring students select it.

boxplot(total_score ~ item5,
        xlab = "Response Option",
        ylab = "Total Test Score",
        main = "Distractor Analysis – Item 5",
        col = c("lightblue","pink","lightgreen","yellow"))

Item_1	Item_2	Item_3	Item_4	Item_5	Item_6	Item_7	Item_8	Item_9	Item_10	Item_11	Item_12	Item_13	Item_14	Item_15
C	A	A	D	A	D	B	B	D	B	B	D	C	C	D
A	A	D	B	A	D	A	A	B	B	A	B	A	A	C
A	A	A	A	A	A	A	A	A	A	A	D	A	A	A
B	C	A	C	A	B	D	B	B	A	D	B	A	D	A
B	D	C	B	C	A	A	A	C	C	A	A	D	B	C

Item_1	Item_2	Item_3	Item_4	Item_5	Item_6	Item_7	Item_8	Item_9	Item_10	Item_11	Item_12	Item_13	Item_14	Item_15
0	1	1	0	1	0	0	0	0	0	0	0	0	0	0
1	1	0	0	1	0	1	1	0	0	1	0	1	1	0
1	1	1	1	1	1	1	1	1	1	1	0	1	1	1
0	0	1	0	1	0	0	0	0	1	0	0	1	0	1
0	0	0	0	0	1	1	1	0	0	1	1	0	0	0

Item_1	Item_2	Item_3	Item_4	Item_5	Item_6	Item_7	Item_8	Item_9	Item_10	Item_11	Item_12	Item_13	Item_14	Item_15
C	A	A	D	A	D	B	B	D	B	B	D	C	C	D
A	A	D	B	A	D	A	A	B	B	A	B	A	A	C
A	A	A	A	A	A	A	A	A	A	A	D	A	A	A
B	C	A	C	A	B	D	B	B	A	D	B	A	D	A
B	D	C	B	C	A	A	A	C	C	A	A	D	B	C

Item_1	Item_2	Item_3	Item_4	Item_5	Item_6	Item_7	Item_8	Item_9	Item_10	Item_11	Item_12	Item_13	Item_14	Item_15
0	1	1	0	1	0	0	0	0	0	0	0	0	0	0
1	1	0	0	1	0	1	1	0	0	1	0	1	1	0
1	1	1	1	1	1	1	1	1	1	1	0	1	1	1
0	0	1	0	1	0	0	0	0	1	0	0	1	0	1
0	0	0	0	0	1	1	1	0	0	1	1	0	0	0

Item_1	Item_2	Item_3	Item_4	Item_5	Item_6	Item_7	Item_8	Item_9	Item_10	Item_11	Item_12	Item_13	Item_14	Item_15
C	A	A	D	A	D	B	B	D	B	B	D	C	C	D
A	A	D	B	A	D	A	A	B	B	A	B	A	A	C
A	A	A	A	A	A	A	A	A	A	A	D	A	A	A
B	C	A	C	A	B	D	B	B	A	D	B	A	D	A
B	D	C	B	C	A	A	A	C	C	A	A	D	B	C

Item_1	Item_2	Item_3	Item_4	Item_5	Item_6	Item_7	Item_8	Item_9	Item_10	Item_11	Item_12	Item_13	Item_14	Item_15
0	1	1	0	1	0	0	0	0	0	0	0	0	0	0
1	1	0	0	1	0	1	1	0	0	1	0	1	1	0
1	1	1	1	1	1	1	1	1	1	1	0	1	1	1
0	0	1	0	1	0	0	0	0	1	0	0	1	0	1
0	0	0	0	0	1	1	1	0	0	1	1	0	0	0