Federico Ferrero


1. Purpose of this tutorial

This presentation simulates a realistic multiple-choice test, conduct a Classical Test Theory (CTT) item analysis, and interpret item-level and test-level statistics using R. The workflow mirrors what is typically done in operational assessment and evaluation contexts.


2. Setup

Clearing the workspace prevents interference from existing objects. The psych package is commonly used for CTT analyses, and setting a seed ensures reproducible results.

# Clean workspace
rm(list = ls())

# Load required library
library(psych)
## Warning: package 'psych' was built under R version 4.4.3
# Reproducibility
set.seed(123)

3. Simulation parameters

Two hundred examinees and fifteen items approximate a typical pilot or short operational test. Each item has four options beins A the correct one. The latent ability variable (theta) is used only to generate realistic response behavior; in CTT, ability is not directly observed.

n_persons <- 200     # Number of test takers
n_items   <- 15      # Number of test items

options <- c("A", "B", "C", "D")
correct_key <- rep("A", n_items)

# Simulated latent ability (not observed in practice)
theta <- rnorm(n_persons, mean = 0, sd = 1)

4. Simulating a multiple-choice item

This function generates item responses such that higher-ability students are more likely to answer correctly, while lower-ability students choose among distractors.Distractors are intentionally uneven to reflect realistic misconception patterns. Although the simulation is inspired by IRT, the analysis remains purely CTT-based.

simulate_item <- function(theta, a_range = c(1.0, 1.5), b_range = c(-1, 1)) {
  sapply(theta, function(t) {
    a <- runif(1, a_range[1], a_range[2])  # discrimination
    b <- runif(1, b_range[1], b_range[2])  # difficulty
    
    # Probability of a correct response
    p <- 1 / (1 + exp(-a * (t - b)))
    correct <- rbinom(1, 1, prob = p)
    
    if (correct == 1) {
      "A"
    } else {
      distractors <- options[options != "A"]
      probs <- c(0.5, 0.3, 0.2)  # Unequal distractor attractiveness
      sample(distractors, 1, prob = probs)
    }
  })
}

5. Generating the test: Item 5 as problematic one

Item 5 is deliberately constructed to be easier and less discriminating. This allows us to demonstrate how CTT and distractor analyses help identify items that require revision.

mc_data <- sapply(1:n_items, function(i) {
  if (i == 5) {
    # Intentionally weaker item
    simulate_item(theta,
                  a_range = c(0.6, 0.9),
                  b_range = c(-1.5, -0.5))
  } else {
    simulate_item(theta)
  }
})

mc_data <- as.data.frame(mc_data)
colnames(mc_data) <- paste0("Item_", 1:n_items)

The table below shows the first five columns of the multiple-choice response dataset (mc_data): it contains raw multiple-choice responses (A–D).

First 5 rows of the simulated test data
Item_1 Item_2 Item_3 Item_4 Item_5 Item_6 Item_7 Item_8 Item_9 Item_10 Item_11 Item_12 Item_13 Item_14 Item_15
C A A D A D B B D B B D C C D
A A D B A D A A B B A B A A C
A A A A A A A A A A A D A A A
B C A C A B D B B A D B A D A
B D C B C A A A C C A A D B C

6. Scoring the test

Under CTT assumptions, responses are dichotomously scored: 1 for correct and 0 for incorrect. No modeling or partial credit is used.

scored_data <- as.data.frame(
  sapply(mc_data, function(x) as.numeric(x == "A"))
)

The table below shows the first five columns of the corresponding scored dataset (scored_data). This step is used to verify data structure and alignment before conducting Classical Test Theory (CTT) item analysis.One row, one student. One column per item. 0 or 1 scoring. No missing data rate per item.

First 5 rows of the scored test data
Item_1 Item_2 Item_3 Item_4 Item_5 Item_6 Item_7 Item_8 Item_9 Item_10 Item_11 Item_12 Item_13 Item_14 Item_15
0 1 1 0 1 0 0 0 0 0 0 0 0 0 0
1 1 0 0 1 0 1 1 0 0 1 0 1 1 0
1 1 1 1 1 1 1 1 1 1 1 0 1 1 1
0 0 1 0 1 0 0 0 0 1 0 0 1 0 1
0 0 0 0 0 1 1 1 0 0 1 1 0 0 0

7. Reliability and item statistics

alpha_results <- alpha(scored_data)

item_difficulty     <- colMeans(scored_data)
item_discrimination <- alpha_results$item.stats$r.drop
alpha_if_deleted    <- alpha_results$alpha.drop$raw_alpha
Statistic Definition / Description Interpretation / Notes
Item difficulty (p-value) Proportion of students who answered correctly < .20 → too hard
.30–.70 → desirable
> .80 → too easy
Item discrimination (point biserial) (r.drop) Correlation between item score and total test score (excluding itself) < .20 → weak
.30–.39 → acceptable
≥ .40 → strong
Cronbach’s alpha Internal consistency of the test Sensitive to number of items and item quality
Alpha if item deleted Would reliability improve if this item were removed?

8. CTT summary table

ctt_summary <- data.frame(
  Item = colnames(scored_data),
  Difficulty = round(item_difficulty, 2),
  Discrimination = round(item_discrimination, 2),
  Alpha_if_deleted = round(alpha_if_deleted, 2)
)

print(ctt_summary)
##            Item Difficulty Discrimination Alpha_if_deleted
## Item_1   Item_1       0.54           0.44             0.75
## Item_2   Item_2       0.54           0.34             0.76
## Item_3   Item_3       0.46           0.45             0.75
## Item_4   Item_4       0.47           0.41             0.76
## Item_5   Item_5       0.60           0.25             0.77
## Item_6   Item_6       0.50           0.38             0.76
## Item_7   Item_7       0.50           0.42             0.75
## Item_8   Item_8       0.54           0.37             0.76
## Item_9   Item_9       0.50           0.37             0.76
## Item_10 Item_10       0.45           0.30             0.76
## Item_11 Item_11       0.51           0.33             0.76
## Item_12 Item_12       0.52           0.38             0.76
## Item_13 Item_13       0.47           0.42             0.75
## Item_14 Item_14       0.49           0.38             0.76
## Item_15 Item_15       0.47           0.36             0.76
cat("Overall Cronbach's alpha:",
    round(alpha_results$total$raw_alpha, 2))
## Overall Cronbach's alpha: 0.77

Analysis

Overall: The test is well-constructed with good internal consistency.

Item 5 is weaker in discrimination and slightly easier, so it could be reviewed or revised.

All other items are functioning well, contributing positively to reliability.

9. Distractor analysis (Item 5)

Functional distractor

Non-functional distractor

item5 <- mc_data$Item_5
total_score <- rowSums(scored_data)

# Option frequencies
prop.table(table(item5))
## item5
##     A     B     C     D 
## 0.605 0.185 0.135 0.075
# Mean total score by option

aggregate(total_score,
          by = list(Option = item5),
          mean)
##   Option        x
## 1      A 8.661157
## 2      B 6.270270
## 3      C 6.037037
## 4      D 4.666667
boxplot(total_score ~ item5,
        xlab = "Response Option",
        ylab = "Total Test Score",
        main = "Distractor Analysis – Item 5",
        col = c("lightblue","pink","lightgreen","yellow"))