Project History and Acknowledgments

Originally conceived during advanced doctoral training in Psychometric Methods at Ohio University (2013), this project grew out of intensive coursework in Classical Test Theory (CTT) and Questionnaire Development. The early work laid a methodological foundation for reliability, validity, and measurement modeling, while also building a comprehensive framework for educational and psychological assessment. From the beginning, the project was structured not merely as an applied exercise but as a pedagogical exploration of core measurement concepts, designed to make the underlying statistical and psychometric reasoning transparent.

A central design principle was the avoidance of black-box applications. Wherever possible, built-in functions were set aside in favor of writing custom routines, allowing each procedure—whether reliability estimation, or item analysis—to be constructed step by step. This approach fostered a hands-on progression of concepts, highlighting the logic behind calculations rather than simply their outputs. In doing so, the project emphasized deeper conceptual understanding—values essential for both teaching and research.

Over time, the project expanded through technological augmentation, most notably the integration of artificial intelligence and modern computational techniques. These innovations enhanced the capacity for simulation studies, adaptive procedures, and advanced diagnostics, while still preserving the psychometric integrity of the original framework. By combining classical measurement theory with contemporary computational advances, the work demonstrates how rigorous, transparent methodology can evolve responsibly without sacrificing its foundational principles.


1 Questionnaire Development and Likert Scales

When designing questionnaires or surveys—particularly those employing Likert scales—it is critical to ensure that the instrument is valid, reliable, and capable of producing meaningful data. Math anxiety, for example, is a psychological construct characterized by fear, tension, or apprehension when engaging in mathematical tasks. Accurately measuring such a construct requires a carefully developed Likert-type questionnaire grounded in sound psychometric principles. The following outlines a step-by-step process for developing a robust scale to assess math anxiety.


1.1 Questionnaire Development Steps

  • 1. Define the Construct: Clearly specify the psychological or behavioral construct of interest (e.g., math anxiety, job satisfaction). (DeVellis, 2017)
  • 2. Generate Items: Create a pool of items that comprehensively represent the construct’s dimensions, ensuring content validity, clarity, and balance. (Carmines & Zeller, 1979)
  • 3. Pilot Test the Questionnaire: Administer the preliminary items to a small, representative sample to evaluate clarity, response patterns, and initial reliability.
  • 4. Conduct Item Analysis: Examine item performance using statistics such as item–total correlations, response distributions, and variance. Revise or remove items with poor psychometric properties.
  • 5. Assess Reliability: Evaluate internal consistency (e.g., Cronbach’s alpha, with α ≥ 0.70 generally acceptable) and, where appropriate, test–retest reliability or split-half methods. (Cronbach, 1951)
  • 6. Validate the Scale: Use exploratory or confirmatory factor analysis and other validity checks to confirm the dimensional structure and ensure the questionnaire measures the intended construct.

1.2 Likert Scales

A Likert scale is a widely used survey instrument designed to measure attitudes, beliefs, or perceptions. Rather than restricting respondents to simple yes/no answers, it presents a series of statements and asks individuals to indicate the extent to which they agree or disagree. This approach captures both the direction (positive vs. negative) and the intensity (weak vs. strong) of a respondent’s attitude. When responses are aggregated across items, Likert scales provide reliable measures of complex constructs such as satisfaction, motivation, or anxiety.

The method was first introduced by Rensis Likert (1932) in his doctoral dissertation at Columbia University, A Technique for the Measurement of Attitudes. Likert’s innovation was to capture degrees of agreement beyond binary responses. His scale quickly became foundational in social science research, offering a practical and statistically analyzable method for studying complex constructs such as prejudice, job satisfaction, and political attitudes. Over time, the Likert format was adapted into various versions (e.g., 4-point, 5-point, 7-point, and 10-point scales) and remains a cornerstone of survey methodology, psychometrics, and educational measurement.

The scale typically follows an ordered response format, most often a 5-point or 7-point continuum.


Example: 5-Point Likert Scale

  • [1] Strongly Disagree – Complete rejection of the statement
  • [2] Disagree – Partial or moderate rejection
  • [3] Neutral – Neither agreement nor disagreement / undecided
  • [4] Agree – Partial or moderate endorsement
  • [5] Strongly Agree – Complete endorsement of the statement

Example: 7-Point Likert Scale

  • [1] Strongly Disagree – Complete rejection of the statement
  • [2] Disagree – Moderate rejection
  • [3] Slightly Disagree – Mild rejection ← added nuance
  • [4] Neutral – Neither agreement nor disagreement / undecided
  • [5] Slightly Agree – Mild endorsement ← added nuance
  • [6] Agree – Moderate endorsement
  • [7] Strongly Agree – Complete endorsement of the statement

Note.

  • A 5-point scale is often preferred for its simplicity and ease of use, especially in general surveys or when respondents may have limited time or literacy.
  • A 7-point scale provides greater sensitivity and nuance, allowing for finer distinctions in attitudes or perceptions, and is often used in research where subtle differences are important.
  • Even-numbered scales (e.g., 4 or 6 points): Force a choice by removing a neutral option.
Consideration Recommendation
Number of points 5–7 points optimal (balances sensitivity & reliability)
Labeling Anchor all points (not just endpoints)
Neutral midpoint Include unless forced choice is required
Direction consistency Keep positive/negative orientation uniform

To improve validity and reduce response bias, researchers often include reverse-coded items—statements worded in the opposite direction (e.g., “I feel confident solving math problems” vs. “I often struggle with math problems”). This ensures that respondents engage more thoughtfully with each item rather than responding automatically in the same direction.


Example of Polarity / Reverse Coding

Positive Wording Reverse Wording
“I feel confident solving math problems.” “I often struggle to solve math problems.”

Distinguishing Items from Scales

It is important to distinguish between Likert-type items and a full Likert scale:

  • A Likert-type item refers to a single statement with an ordered response format (e.g., one question about agreement).
  • A Likert scale—more strictly defined—is a multi-item instrument that aggregates responses across several such items to measure a broader underlying construct.

While this distinction is often blurred in everyday usage, it is significant in research methodology: a single item provides only limited information, whereas a true Likert scale yields more robust and reliable measurement. (Carmines & Zeller, 1979)


1.3 Essential Qualities of Effective Likert Item Wording

These principles build upon one another to ensure that items are valid, interpretable, and reliable:

  • Clear (A)
    • The question should be immediately understandable
    • Avoid jargon or complex syntax
      • Bad: "Do you concur with the pedagogical methodologies?"
      • Good: "Do you agree with the teaching methods?"
  • Specific (B)
    • Focuses on one concrete concept
    • Avoid vague or abstract phrasing
      • Bad: "Do you like the environment?"
      • Good: "Is your classroom well-lit and quiet?"
  • Unambiguous (C)
    • Eliminates multiple interpretations
    • Avoid double-barreled questions
      • Bad: "The instructor was knowledgeable and approachable"
      • Good: "The instructor demonstrated subject knowledge"
  • Single-dimensional (D)
    • Measures exactly one attitude or opinion
    • Critical for reliable scaling
      • Bad: "The course was useful and well-organized"
      • Good: "The course content was practically useful"

1.3.1 Why This Flow Matters

  • Directional: Each principle depends on the one below it, forming a hierarchy.
  • Diagnostic: Makes it easy to spot where an item breaks down (e.g., clear but not specific).
  • Design Guide: Think of these as a ladder built from the bottom up:

1.3.2 How to Use the Ladder

  1. Start at the bottom (D): Define one construct to measure
  2. Move up (C): Phrase it so it can’t be misinterpreted
  3. Refine (B): Make it concrete and focused
  4. Finish (A): Simplify wording so it’s clear to respondents

Practical Example

Developing the item: "I feel safe on campus at night"

  • Single-dimensional: Focuses only on safety perception
  • Unambiguous: Specifies “at night” timeframe
  • Specific: References “on campus” location
  • Clear: Uses simple, direct language

This sequence ensures each question in your Likert scale measures what you intend—without confounding factors.


1.4 Components of Math Anxiety

Math anxiety represents a robust and self-sustaining psychological ecosystem—a complex, synergistic network of cognitive, affective, physiological, and behavioral components that collectively lock an individual into a debilitating cycle of avoidance and impaired performance (Ashcraft & Krause, 2007). Rather than constituting isolated symptoms, this system demonstrates deeply interconnected dynamics where each component simultaneously fuels and is fueled by others through feedback loops remarkably resistant to intervention.

The pathological cycle typically originates with maladaptive appraisal processes that frame mathematics as a personal threat. These appraisals often develop through cultural transmission mechanisms, including parental modeling of mathematical avoidance (Maloney & Beilock, 2012) or immersion in fixed mindset educational cultures that equate mathematical struggle with inherent inability (Dweck, 2008). This initial threat appraisal activates a cascade of neurophysiological responses beginning with heightened amygdala activation, which amplifies catastrophic thought patterns and magnifies perceived potential for failure (Lyons & Beilock, 2012). This neural activation subsequently drives hyperactivity within the brain's default mode network, facilitating the self-referential and intrusive negative thinking characteristic of mathematical anxiety (Ashcraft & Krause, 2007; Buckner et al., 2008).

Concurrently, the hypothalamic-pituitary-adrenal (HPA) axis activates, elevating cortisol levels by 15-30% and significantly impairing hippocampal function—thereby crippling the memory retrieval processes essential for mathematical problem-solving (Lupien et al., 2007; Mattarella-Micke et al., 2011). Parallel activation of the sympathetic nervous system initiates a comprehensive fight-or-flight response, producing measurable physical symptoms including physiological tremor (0.5–3Hz) and up to 25% reduction in salivary flow (Lyons & Beilock, 2012). This biological cascade consumes 70-80% of prefrontal cortex resources typically allocated for executive functions, inducing working memory overload and creating profound attentional control deficits that render focused mathematical thought nearly impossible (Eysenck, 2007; Ramirez et al., 2013).

The cognitive-physiological hijacking establishes a subjective experience of mathematical failure that feels both inevitable and insurmountable. This experience generates intense affective responses, particularly anticipatory anxiety that manifests up to 24 hours before evaluative events and demonstrates skin conductance responses three times above baseline levels (Pekrun & Linnenbrink-Garcia, 2012). This dual-process emotion dysregulation (Gross, 2015) becomes compounded by frustration responses when cognitive load exceeds personal thresholds, frequently triggering complete task disengagement (Goetz et al., 2013). The affective experience often culminates in mathematically-specific shame—a social-evaluative pain associated with anterior cingulate cortex activation and response magnitudes twice as strong in female populations (Turner, 2002). This shame response is powered by anticipatory anxiety circuitry that renders the fear of future mathematical failure as potent as the experience itself (Grupe & Nitschke, 2013).

The convergence of cognitive and affective distress inevitably manifests through observable behavioral outcomes. To escape the aversive psychological state, individuals engage in avoidance patterns maintained through operant conditioning mechanisms, wherein the immediate relief provided by mathematical avoidance negatively reinforces the escape behavior (Núñez-Peña & Suárez-Pellicioni, 2016a; Skinner, 1938). These patterns yield STEM attrition rates three times higher among math-anxious students compared to their non-anxious peers. Additional safety behaviors emerge, including dependence on calculators for basic arithmetic operations—a strategy that provides short-term anxiety reduction but maintains long-term mathematical anxiety by preventing the development of computational fluency and numerical confidence (Ashcraft & Moore, 2009; Ferster & Skinner, 1957). Procrastination represents another common safety behavior, leading to last-minute cramming that increases cognitive load by approximately 40% during actual examination settings, thereby ensuring performance remains substantially below actual capability (Salkovskis, 1991; Solomon & Rothblum, 1984).

These avoidance behaviors systematically prevent the mastery experiences necessary for building accurate mathematical self-efficacy, resulting in students underestimating their actual abilities by 20-30% compared to objective performance metrics (Pajares & Miller, 1994). This motivational collapse originates in social cognitive processes where mathematical self-concept forms through comparative social processes and interpreted experiences (Bandura, 1986; Marsh, 1990). Vicarious learning mechanisms play a particularly powerful role, with observations of peer mathematical struggle reducing self-efficacy twice as effectively as personal failure experiences (Schunk, 1987). These processes converge with fixed mindset beliefs that promote attributions of mathematical struggle to stable, internal ability deficits rather than modifiable effort or strategy factors, ultimately producing 50% reductions in persistence following mathematical difficulty (Dweck, 2008; Weiner, 1985).

The motivational deterioration becomes further cemented through impaired value attribution processes. Utility value deficits emerge when approximately 60% of students cannot identify legitimate real-world applications for mathematical content, effectively dismantling their rationale for engagement according to Expectancy-Value Framework principles (Eccles, 1983; Eccles & Wigfield, 2002). These deficits can escalate into comprehensive identity conflicts wherein beliefs that "math isn't for people like me" predict 35% engagement reductions because mathematical tasks feel fundamentally incongruent with self-concept (Oyserman, 2007; Wigfield & Meece, 1988). The situation becomes compounded by profound cost-benefit miscalibrations where students overestimate required mathematical effort by 200-300%, making avoidance appear as the only rational behavioral choice (Flake et al., 2015; Wigfield et al., 2016).

This internal psychological system becomes amplified and sustained through external social and evaluative contexts. Social comparison processes—particularly upward comparisons to high-performing peers—can elevate cortisol levels by 23% and reduce mathematical accuracy by 19%, creating powerful social identity threats (Festinger, 1954; Huguet et al., 2009; Tajfel & Turner, 1979). The testing environment itself functions as a situational demand characteristic, with time pressure alone producing 50% performance declines among vulnerable individuals (Murray, 1938; Sarason, 1984). Pedagogical approach also plays a critical moderating role, as inquiry-based instructional methods can reduce mathematical anxiety by 0.8 standard deviations compared to traditional lecture-based formats by fundamentally altering cognitive-affective appraisals of the subject matter (Lazarus, 1991; Stodolsky, 1985).

Although contextual factors including time constraints, instructional methodologies, and assessment formats demonstrably influence mathematical experiences, they are more accurately conceptualized as situational antecedents or moderators of mathematical anxiety rather than definitional features of the construct itself. The core dimensions of mathematical anxiety—cognitive worry, affective distress, and physiological arousal—capture the essential phenomenon as consistently defined throughout the research literature (Ashcraft & Ridley, 2005; Richardson & Suinn, 1972). Contextual items frequently create construct validity concerns by conflating anxiety with environmental conditions rather than measuring stable individual differences. Empirical research confirms that while situational factors can exacerbate or buffer anxious responses, they do not represent enduring latent traits (Hopko et al., 2003; Tobias, 1986). Including such items risks construct dilution and obscures the fundamental emotional and cognitive processes defining mathematical anxiety. Consequently, contextual influences appear more appropriately modeled as predictors or moderators rather than constitutive components of the mathematical anxiety construct, justifying their exclusion from our measurement approach.


1.5 The Comprehensive Math Anxiety Assessment Scale (CMAAS)

This framework organizes math anxiety into seven interconnected domains—cognitive worry, affective distress, physiological arousal, behavioral avoidance/approach patterns, value beliefs, self-efficacy beliefs, and social–cultural processes. These domains capture the stable, trait-like features of math anxiety that endure across contexts, distinguishing them from situational antecedents (e.g., time pressure, pedagogy) that merely moderate expression. The CMAAS was developed to operationalize these domains with balanced adaptive (–) and maladaptive (+) indicators. By incorporating both risk and resilience factors, the CMAAS provides a theoretically grounded, psychometrically robust tool for assessing math anxiety as an integrated system of cognitive, emotional, physiological, motivational, and social components.


Key Features of a Math Anxiety Scale:

Self-Report Format

The scale relies on self-report, where individuals rate their agreement with statements describing math-related thoughts, feelings, and behaviors. This approach captures the subjective and experiential nature of math anxiety, providing a direct window into respondents’ internal states in a standardized format.

Multi-Dimensional Structure

Math anxiety is assessed across multiple interrelated domains to reflect its complexity. Cognitive aspects include negative thoughts and interference with working memory; affective components encompass worry, fear, and emotional distress; physiological responses involve bodily arousal such as sweating, tension, or elevated heart rate; and behavioral indicators reflect avoidance and procrastination around math tasks. Evaluating these dimensions simultaneously yields a more comprehensive understanding of how math anxiety operates across mental, emotional, physical, and behavioral levels.

Likert-Scale Measurement

Responses are collected on a Likert-type scale, most often ranging from 1 = Strongly Disagree to 5 = Strongly Agree. This scaling method quantifies subjective experience in an accessible way, ensures comparability across individuals, and supports reliable scoring procedures suitable for both descriptive summaries and inferential analyses.

Polarity of Items

Items are deliberately written in both positive (e.g., “I feel confident solving math problems”) and negative (e.g., “I avoid math whenever possible”) directions. Incorporating polarity reduces acquiescence bias, the tendency to agree with items regardless of content, and promotes deeper cognitive engagement with each statement. This enhances both the validity and reliability of responses.

Quantifiable Outcomes

Individual item scores are aggregated into total and subscale indices, with higher scores reflecting greater levels of math anxiety. These outcomes serve as standardized indicators, allowing comparisons between respondents and facilitating research on the links between math anxiety, academic achievement, and broader psychological well-being.

Alternative Forms

Each item was developed with a parallel alternative form (e.g., 1a and 1b), providing two semantically equivalent but uniquely worded statements that assess the same underlying construct. This design serves several important purposes. First, it allows for the creation of equivalent test versions that can be rotated across administrations, thereby minimizing practice effects and memory bias in longitudinal studies. Second, it supports robust test–retest reliability evaluations by ensuring that repeated administrations measure the same psychological content without simply replicating identical wording. Third, the inclusion of alternative forms increases item-level flexibility for researchers, allowing them to tailor instruments to specific study designs (e.g., experimental manipulations, classroom surveys, clinical assessments) while maintaining consistent coverage of the theoretical domains. Finally, this approach enhances construct validity by enabling analyses of item equivalence, thereby confirming that math anxiety is captured consistently regardless of minor wording differences.


Comprehensive Math Anxiety Assessment Scale (CMAAS)
Likert Item Polarity Theoretical Basis Underlying Mechanism
Cognitive (Negative Thought Patterns/Working Memory)
1a. My mind goes blank when I try to start a math problem

1b. I struggle to recall even simple steps when beginning a math problem
(+) Working memory overload — interference of anxiety with limited-capacity storage during math tasks (Ramirez et al., 2013) Prefrontal cortex resource depletion — anxiety-driven diversion of attentional control, reducing effective executive function (Eysenck et al., 2012)
2a. I feel confident solving math problems without panicking

2b. I can stay focused on math problems without becoming overwhelmed
(–) Cognitive interference reduction — disruption of intrusive thoughts during problem solving (Ashcraft & Kirk, 2001) Default mode network regulation — suppression of self-referential, anxiety-linked rumination during task engagement (Raichle et al., 2001)
3a. I look forward to learning new math concepts

3b. I feel eager to explore new math ideas when they are introduced
(–) Anticipatory anxiety decrease — reduced tendency to catastrophize about upcoming challenges (Grupe & Nitschke, 2013) Amygdala–prefrontal connectivity — strengthened regulation of emotional reactivity during learning (Phelps & LeDoux, 2005)
Affective (Emotional Responses)
4a. I enjoy solving challenging math problems

4b. Solving difficult math questions gives me a sense of satisfaction
(–) Frustration tolerance — capacity to sustain positive affect when faced with difficulty (Goetz et al., 2013) Cognitive reappraisal capacity — ability to reframe challenging tasks as opportunities for growth (Gross, 2002)
5a. I feel a sense of dread the day before a math class

5b. I feel anxious the night before attending a math class
(+) Emotional reactivity — heightened negative affect in anticipation of math-related tasks (Pekrun, 2006) Limbic system reactivity — overactivation of fear- and threat-processing regions (LeDoux, 1996)
6a. I feel proud when I make progress on a hard math problem

6b. Making progress on a tough math task makes me feel accomplished
(–) Achievement emotions — enjoyment/pride — positive affect tied to progress and mastery (Pekrun & Linnenbrink-Garcia, 2017) Cognitive reappraisal — reframing difficulty as an opportunity for growth (Pekrun, 2006)
Physiological (Bodily Reactions)
7a. My heart races when I'm called on in math class

7b. My hands start to shake when I’m unexpectedly asked a math question
(+) Sympathetic arousal — heightened physiological stress response in evaluative math contexts (Lyons & Beilock, 2012) Sympathetic nervous system activation — fight-or-flight cardiovascular response to perceived threat (Sapolsky, 2004)
8a. I remain physically calm during math tests

8b. I can keep my body relaxed during math tests
(–) Physiological regulation — ability to sustain bodily calm through autonomic control (Porges, 2007) Parasympathetic tone preservation — vagal mechanisms supporting stress resilience during evaluation (Thayer et al., 2012)
9a. I have trouble sleeping the night before a math exam

9b. I find it hard to rest well the evening before a math exam
(+) Somatic tension — heightened physical arousal interfering with restorative sleep (Owens et al., 2012) Neuromuscular tension feedback — sustained muscle activation signaling stress and perpetuating anxiety (Blascovich & Mendes, 2000)
Behavioral (Approach–Avoidance)
10a. I avoid taking math classes whenever possible

10b. I steer clear of electives or activities that involve a lot of math
(+) Avoidance behavior — tendency to disengage from anxiety-provoking contexts (Núñez-Peña & Suárez-Pellicioni, 2016b) Negative reinforcement cycle — avoidance temporarily reduces distress, reinforcing long-term withdrawal (Skinner, 1953)
11a. I actively seek math challenges to improve

11b. I take on extra math practice to challenge myself
(–) Approach motivation — orientation toward mastery goals and engagement with challenge (Elliot & Church, 2006) Competence motivation system — intrinsic drive to build skills and efficacy through practice (White, 1959)
12a. I put off doing my math homework until the last minute

12b. I delay starting my math assignments until the deadline is close
(+) Procrastination cycle — habitual delay of task initiation in response to anxiety or aversion (Solomon & Rothblum, 1984) Temporal discounting bias — preference for short-term avoidance over long-term academic benefit (Ainslie, 1975)
Value Beliefs (Utility/Attainment)
13a. I don't see how I'll ever use this math in real life

13b. I doubt that the math I’m learning will be useful in my future
(+) Low utility value perception — belief that math lacks relevance to personal goals (Eccles, 1983) Constricted future-time perspective — reduced motivation from failing to connect present effort with future benefit (Zimbardo & Boyd, 1999)
14a. Being good at math is not an important part of who I am

14b. Doing well in math is not central to how I see myself
(+) Low attainment value concern — math achievement not integrated into self-concept or identity (Wigfield & Meece, 1988) Possible selves framework — limited incorporation of math ability into envisioned future selves (Markus & Nurius, 1986)
15a. Math requires more time and effort than it's worth

15b. Math seems to demand more effort than the results are worth
(+) Performance impact — perception that cognitive/temporal costs of math outweigh potential benefits (Ashcraft, 2002) Academic self-concept formation — diminished integration of math ability into overall self-concept (Shavelson et al., 1976)
Self-Efficacy (Confidence/Persistence)
16a. I believe I can succeed in math with effort

16b. With persistence, I believe I can improve my math performance
(–) Growth mindset — belief that ability develops through sustained effort (Dweck, 2008) Attributional style — attributing success to controllable, internal causes like persistence (Weiner, 1985)
17a. Difficult math problems motivate me to learn more

17b. Challenging math exercises push me to strengthen my skills
(–) Mastery orientation — focus on developing competence through challenge (Ames, 1992) Challenge–threat appraisal — perceiving difficulty as an opportunity for growth rather than as a threat (Blascovich et al., 2001)
18a. I give up easily when math gets difficult

18b. I often feel there is no point in trying when math gets hard
(+) Learned helplessness — expectancy of failure that undermines persistence (Abramson et al., 1978) Effort withdrawal threshold — low tolerance for sustained engagement before disengagement (Eccles & Wigfield, 1998)
Social–Cultural (Comparative/Help-Seeking)
19a. I see asking for help as a useful strategy when I'm stuck on a math concept

19b. I am comfortable asking others for help when I don’t understand a math concept
(–) Social safety — perception that classroom help-seeking will not lead to embarrassment or judgment (Turner, 2002) Help-seeking as coping — adaptive use of social resources to regulate difficulty and promote learning (Karabenick, 2003)
20a. Praise from my teacher on my math work boosts my confidence

20b. Encouragement from my teacher makes me feel more capable in math
(–) Positive reinforcement - verbal persuasion as a source of self-efficacy (Bandura, 1997) Verbal persuasion effects - efficacy calibration via credible social feedback (Bandura, 1997)
21a. Watching classmates solve problems quickly makes me doubt my own abilities

21b. Seeing classmates solve math faster makes me question my own ability
(+) Social comparison — self-evaluations shaped by observing peers’ performance (Festinger, 1954) Upward social comparison — comparing to higher-performing peers in ways that undermine confidence (Festinger, 1954)

This Shiny app builds an interactive questionnaire for the Comprehensive Math Anxiety Assessment Scale (CMAAS). The user interface presents each of the 21 items as a row in a grid. The overall design allows participants to complete the survey interactively, receive immediate feedback in the form of a summary table, and export their responses for further analysis.

Important Note: This is a static preview of the Shiny application code. To run the interactive questionnaire, you must execute this code in RStudio or an R environment with Shiny installed. Markdown documents rendered to static HTML cannot run dynamic Shiny applications.

library(shiny)
library(bslib)
library(dplyr)

# ----------------------------
# Config: items & scale labels
# ----------------------------
likert_items <- c(
    "1. My mind goes blank when I try to start a math problem",
    "2. I feel confident solving math problems without panicking",
    "3. I look forward to learning new math concepts",
    "4. I enjoy solving challenging math problems",
    "5. I feel a sense of dread the day before a math class",
    "6. I want to improve my math skills",
    "7. My heart races when I'm called on in math class",
    "8. I remain physically calm during math tests",
    "9. I have trouble sleeping the night before a math exam",
    "10. I avoid taking math classes whenever possible",
    "11. I actively seek math challenges to improve",
    "12. I put off doing my math homework until the last minute",
    "13. I don't see how I'll ever use this math in real life",
    "14. Being good at math is not an important part of who I am",
    "15. Math requires more time and effort than it's worth",
    "16. I believe I can succeed in math with effort",
    "17. Difficult math problems motivate me to learn more",
    "18. I give up easily when math gets difficult",
    "19. I see asking for help as a useful strategy when I'm stuck on a math concept",
    "20. Praise from my teacher on my math work boosts my confidence",
    "21. Watching classmates solve problems quickly makes me doubt my own abilities"
  )

scale_labels <- c(
  "Strongly Disagree",
  "Disagree",
  "Neutral",
  "Agree",
  "Strongly Agree"
)

# ---------------------------------
# One row = item text + 5 radio dots
# ---------------------------------
likert_row <- function(id, label) {
  tags$div(class = "likert-row",
           tags$div(class = "item-col", label),
           tags$div(class = "cell cell-1", tags$input(type = "radio", name = id, value = "1")),
           tags$div(class = "cell cell-2", tags$input(type = "radio", name = id, value = "2")),
           tags$div(class = "cell cell-3", tags$input(type = "radio", name = id, value = "3")),
           tags$div(class = "cell cell-4", tags$input(type = "radio", name = id, value = "4")),
           tags$div(class = "cell cell-5", tags$input(type = "radio", name = id, value = "5"))
  )
}

# ----------------------------
# UI
# ----------------------------
ui <- fluidPage(
  theme = bs_theme(version = 5, bootswatch = "flatly"),

  tags$head(
    tags$title("Comprehensive Math Anxiety Assessment Scale (CMAAS)"),

    # --- Styles ---
    tags$style(HTML("
      .likert-container { max-height: 70vh; overflow: auto; border: 1px solid var(--bs-border-color); border-radius: 0.5rem; }
      .likert-header, .likert-row { display: grid; grid-template-columns: minmax(380px, 3fr) repeat(5, 25px); gap: 2px; align-items: center; }
      .likert-header { position: sticky; top: 0; background: var(--bs-body-bg); padding: 10px 8px; border-bottom: 2px solid var(--bs-border-color); z-index: 5; }
      .likert-row { padding: 10px 8px; border-bottom: 1px solid var(--bs-border-color); }
      .likert-row:nth-child(odd) { background: rgba(0,0,0,0.02); }
      .likert-row:hover { background: rgba(0,0,0,0.045); }

      .item-col { 
        padding-right: 5px;
        font-size: 0.8rem;
        line-height: 1.2;
      }

      .header-instructions strong { font-size: 1rem; }

      .helper-note { color: var(--bs-secondary-color); font-size: 0.8rem; }
      .sticky-actions { position: sticky; bottom: 0; background: var(--bs-body-bg); padding: 10px 0; border-top: 1px solid var(--bs-border-color); z-index: 6; }

      .likert-header .choice-col { writing-mode: vertical-rl; transform: rotate(180deg); text-align: center; white-space: nowrap; font-size: 0.8rem; padding: 0; height: 120px; line-height: 1.1; display: flex; align-items: center; justify-content: center; }

      .cell { display: flex; justify-content: center; align-items: center; }
      .cell-1 { grid-column: 2; }
      .cell-2 { grid-column: 3; }
      .cell-3 { grid-column: 4; }
      .cell-4 { grid-column: 5; }
      .cell-5 { grid-column: 6; }
      .cell input[type=radio] { transform: scale(1.1); cursor: pointer; margin: 0; }
    ")),

    # --- JS: push radio changes into Shiny ---
    tags$script(HTML("
      document.addEventListener('change', function(e){
        var t = e.target;
        if (!t || t.type !== 'radio') return;
        var id = t.name;
        if (window.Shiny) {
          Shiny.setInputValue(id, t.value, {priority: 'event'});
        }
      });
    ")),

    # --- JS: allow server to run arbitrary snippets (used for Reset) ---
    tags$script(HTML("
      Shiny.addCustomMessageHandler('evaljs', function(msg){
        try { eval(msg.code); } catch(e) {}
      });
    "))
  ),

  fluidRow(
    column(12,
           div(
             "Comprehensive Math Anxiety Assessment Scale (CMAAS)",
             style = "
                font-size: 1.2rem; 
                font-weight: bold; 
                text-align: center; 
                margin-bottom: 1rem;
              "
           ),
           p(class = "helper-note", ""),
           div(class = "likert-container",
               tags$div(class = "likert-header",
                        tags$div(
                          class = "item-col header-instructions",
                          tags$strong("Please rate how much you agree with each statement.")
                        ),
                        lapply(scale_labels, function(lbl) tags$div(class = "choice-col", tags$strong(lbl)))
               ),
               uiOutput("likert_rows")
           ),
           div(class = "sticky-actions",
               actionButton("submit", "Submit responses", class = "btn btn-primary me-2"),
               actionButton("reset",  "Reset", class = "btn btn-secondary me-2"),
               downloadButton("download", "Download CSV")
           ),
           hr(),
           h4("Summary (after submit)"),
           tableOutput("summary")
    )
  )
)

# ----------------------------
# Server
# ----------------------------
server <- function(input, output, session) {
  # Render all Likert rows
  output$likert_rows <- renderUI({
    tagList(
      lapply(seq_along(likert_items), function(i) {
        likert_row(id = paste0("Q", i), label = likert_items[i])
      })
    )
  })

  get_val <- function(id) {
    v <- input[[id]]
    if (is.null(v) || identical(v, "")) return(NA_integer_)
    as.integer(v)
  }

  responses <- reactive({
    vals <- vapply(seq_along(likert_items), function(i) get_val(paste0("Q", i)), integer(1))
    data.frame(
      item_id    = sprintf("Q%02d", seq_along(likert_items)),
      item_text  = likert_items,
      response   = vals,
      label      = factor(vals, levels = 1:5, labels = scale_labels),
      stringsAsFactors = FALSE
    )
  })

  # --- Reset: uncheck all radios + clear Shiny inputs ---
  observeEvent(input$reset, {
    lapply(seq_along(likert_items), function(i) {
      # Uncheck all radios for this item and reset Shiny value
      runjs <- "
        document.querySelectorAll('input[name=\\'ID\\']').forEach(function(el){ el.checked = false; });
        if (window.Shiny) { Shiny.setInputValue('ID', null, {priority:'event'}); }
      "
      session$sendCustomMessage(
        type = "evaljs",
        message = list(code = sub("ID", paste0("Q", i), runjs))
      )
    })
    output$summary <- renderTable(NULL)
  })

  # --- Submit: build summary table (counts by label) ---
  observeEvent(input$submit, {
    df <- responses()
    n_missing <- sum(is.na(df$response))
    if (n_missing > 0) {
      showModal(modalDialog(
        title = "Incomplete",
        sprintf("You still have %d unanswered item%s.", n_missing, ifelse(n_missing == 1, "", "s")),
        easyClose = TRUE, footer = modalButton("OK")
      ))
    }
    out <- df %>%
      count(label, name = "n") %>%
      mutate(label = as.character(label)) %>%
      tidyr::complete(label = scale_labels, fill = list(n = 0)) %>%
      arrange(factor(label, levels = scale_labels))

    output$summary <- renderTable({ out }, striped = TRUE, bordered = TRUE, spacing = "s")
  })

  # --- Download: CSV of individual responses ---
  output$download <- downloadHandler(
    filename = function() paste0("likert_responses_", format(Sys.time(), "%Y%m%d_%H%M%S"), ".csv"),
    content = function(file) {
      write.csv(responses(), file, row.names = FALSE)
    }
  )
}

shinyApp(ui, server)


1.6 Likert Scale Data Simulation


This R code generates a synthetic dataset of Likert-scale responses (1-5) for 300 respondents across 21 questions representing 7 different constructs (Cognitive, Affective, Physiological, Behavioral, Value, Self-Efficacy, and Social/Environmental). This synthetic data simulates realistic survey responses with within-construct correlations while maintaining variability between questions.

# ===========================================================
# 1) SETUP
# ===========================================================
set.seed(42)  # ensures reproducibility for random draws

suppressPackageStartupMessages({
  library(MASS)        # for mvrnorm (multivariate normal generation)
  library(dplyr)       # data wrangling
  library(tidyr)       # reshaping
  library(ggplot2)     # plotting
  library(psych)       # psychometrics functions (not yet used here, but useful)
  library(corrplot)    # correlation visualizations
})

n <- 300  # sample size: number of simulated respondents

# Define item labels across 7 constructs (3 items each = 21 total)
questions <- c("C1", "C2", "C3", # Cognitive
               "A1", "A2", "A3", # Affective
               "P1", "P2", "P3", # Physiological
               "B1", "B2", "B3", # Behavioral
               "V1", "V2", "V3", # Value
               "E1", "E2", "E3", # Self-Efficacy
               "S1", "S2", "S3") # Social

# Construct labels for reference
constructs <- c("Cognitive", "Affective", "Physiological",
                "Behavioral", "Value", "Self_Efficacy", "Social")

# Polarity vector encodes reverse-keyed items:
#   1  = positively keyed (higher = more of trait)
#  -1  = reverse-keyed (must be flipped later)
polarity_vector <- c( C1 =  1, C2 = -1, C3 = -1,   # Cognitive 
                      A1 = -1, A2 =  1, A3 = -1,   # Affective 
                      P1 =  1, P2 = -1, P3 =  1,   # Physiological 
                      B1 =  1, B2 = -1, B3 =  1,   # Behavioral 
                      V1 = -1, V2 =  1, V3 =  1,   # Value 
                      E1 = -1, E2 = -1, E3 =  1,   # Self-Efficacy
                      S1 = -1, S2 = -1, S3 =  1)   # Social 

# Sanity checks: match counts and names
stopifnot(length(questions) == 21L)
stopifnot(identical(sort(names(polarity_vector)), sort(questions)))

# ===========================================================
# 2) LATENT TRAITS
# ===========================================================
# Define 7x7 covariance matrix for latent traits (constructs).
# Stronger correlations within some domains, weaker across others.
Sigma7 <- matrix(c(
  1.0, 0.6, 0.5, 0.4, 0.3, 0.4, 0.2,
  0.6, 1.0, 0.4, 0.5, 0.2, 0.3, 0.3,
  0.5, 0.4, 1.0, 0.3, 0.2, 0.2, 0.1,
  0.4, 0.5, 0.3, 1.0, 0.4, 0.3, 0.2,
  0.3, 0.2, 0.2, 0.4, 1.0, 0.5, 0.3,
  0.4, 0.3, 0.2, 0.3, 0.5, 1.0, 0.4,
  0.2, 0.3, 0.1, 0.2, 0.3, 0.4, 1.0), nrow = 7, byrow = TRUE)

# Generate latent construct scores for n respondents
base_scores <- mvrnorm(n, mu = rep(0, 7), Sigma = Sigma7)

# Each construct has 3 items (used to map theta -> item)
construct_idx <- rep(1:7, each = 3)

# ===========================================================
# 3) ITEM RESPONSE SIMULATION
# ===========================================================
# Define category thresholds for 5-point Likert (z-scores)
base_cp <- c(-Inf, -1.4, -0.5, 0.5, 1.4, Inf)

# Item loadings (discrimination strength); ~0.65–0.90
item_a <- runif(length(questions), 0.65, 0.90)

# Item error variance = 1 - a^2, ensures standardized
item_err_sd <- sqrt(pmax(1e-6, 1 - item_a^2))

# Small random shifts to thresholds across items
th_shift <- rnorm(length(questions), mean = 0, sd = 0.10)

# Simulate ordinal responses
likert_data <- sapply(seq_along(questions), function(j) {
  theta <- base_scores[, construct_idx[j]]          # latent trait
  z <- item_a[j] * theta + rnorm(n, sd = item_err_sd[j]) # add measurement error
  if (polarity_vector[questions[j]] == -1) z <- -z       # flip reverse-keyed items
  cp <- base_cp
  cp[2:5] <- sort(base_cp[2:5] + th_shift[j])            # apply shifted thresholds
  as.integer(cut(z, breaks = cp, labels = FALSE))        # discretize to 1–5
})

# ===========================================================
# 4) RAW & REVERSED DATA
# ===========================================================
colnames(likert_data) <- questions
raw_data <- as.data.frame(likert_data)   # original item scores

final_data <- raw_data
items_to_reverse <- names(polarity_vector)[polarity_vector == -1]

# Reverse-scoring: 6 - response flips 1<->5, 2<->4, 3 stays same
final_data[items_to_reverse] <- 6 - final_data[items_to_reverse]

# Save clean dataset
write.csv(final_data, "final_data.csv", row.names = FALSE)

# ===========================================================
# 5) DISTRIBUTION PLOT
# ===========================================================
# Collapse across all items to show global response distribution
response_proportions <- final_data |>
  pivot_longer(everything(), names_to = "Question", values_to = "Response") |>
  count(Response, name = "Count") |>
  mutate(
    Percentage = 100 * Count / sum(Count),
    Response_Label = factor(
      Response, levels = 1:5,
      labels = c("Strongly\nDisagree", "Disagree", "Neutral", "Agree", "Strongly\nAgree")
    )
  )

print(response_proportions)
## # A tibble: 5 × 4
##   Response Count Percentage Response_Label      
##      <dbl> <int>      <dbl> <fct>               
## 1        1   565       8.97 "Strongly\nDisagree"
## 2        2  1474      23.4  "Disagree"          
## 3        3  2382      37.8  "Neutral"           
## 4        4  1374      21.8  "Agree"             
## 5        5   505       8.02 "Strongly\nAgree"
# Create Likert-style distribution plot
likert_plot <- ggplot(response_proportions,
                      aes(x = Response_Label, y = Percentage, fill = Response_Label)) +
  geom_col(width = 0.7, alpha = 0.9, show.legend = FALSE) +
  geom_text(aes(label = paste0(round(Percentage, 1), "%")),
            vjust = -0.8, size = 4, fontface = "bold", color = "black") +
  geom_text(aes(label = paste0("n=", Count)),
            vjust = 2.2, size = 3, color = "white", fontface = "bold") +
  scale_fill_manual(values = c(
    "#c44e52",  # muted red
    "#dd8452",  # muted orange
    "#ccb974",  # muted yellow
    "#64a6bd",  # muted teal
    "#4c72b0"   # muted blue
  )) +
  labs(
    title = "Distribution of Likert Scale Responses",
    subtitle = paste("Total responses:", sum(response_proportions$Count),
                     "| Sample size:", nrow(final_data)),
    x = "Response Category", y = "Percentage of Responses"
  ) +
  theme_minimal(base_size = 14) +
  theme(
    plot.title = element_text(face = "bold", size = 16, hjust = 0.5),
    plot.subtitle = element_text(color = "gray40", hjust = 0.5),
    axis.title = element_text(face = "bold"),
    axis.text.x = element_text(size = 12),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.y = element_blank()
  ) +
  ylim(0, max(response_proportions$Percentage) * 1.15) +
  geom_hline(yintercept = seq(0, 40, by = 10),
             linetype = "dotted", alpha = 0.3, color = "gray50")

print(likert_plot)

# ===========================================================
# 6) SUMMARY STATS
# ===========================================================
cat("\nSummary Statistics:\n")
## 
## Summary Statistics:
cat("-------------------\n")
## -------------------
cat("Mean response:", round(mean(as.matrix(final_data)), 2), "\n")
## Mean response: 2.97
cat("Response SD:", round(sd(as.matrix(final_data)), 2), "\n")
## Response SD: 1.06

1.7 Reliability Analysis

Reliability refers to the consistency with which test items measure the intended construct. In practice, this means that individuals with the same underlying trait level should obtain similar scores regardless of when the test is administered, which specific items are used, or minor situational factors. A reliable instrument minimizes the influence of random error, ensuring that observed scores reflect the true construct as closely as possible. Without sufficient reliability, even well-designed measures cannot yield valid or interpretable results, since inconsistency introduces noise that obscures meaningful differences between individuals.

  • Reliability Coefficient
    • Expresses measurement precision on a scale from 0 to 1
    • Higher values indicate greater stability and internal agreement
  • Estimation Approaches
    • Test–Retest: Re-administering the same test to the same group
    • Parallel/Alternate Forms: Comparing equivalent test versions
    • Internal Consistency: Assessing inter-item correlations (e.g., Cronbach’s α)

Interpretation: A coefficient of ≥ 0.70 is generally acceptable for research applications, while ≥ 0.90 is preferred in high-stakes contexts.


1.7.1 Cronbach's \(\alpha\)

Cronbach's alpha (Cronbach, 1951) is a widely-used measure of internal consistency reliability that estimates the degree to which test items measure the same underlying construct.

Reliability Thresholds for Cronbach's \(\alpha\)

Coefficient Range Reliability Level Interpretation
\(\alpha \geq 0.90\) Excellent Suitable for high-stakes decisions
\(0.80 \leq \alpha < 0.90\) Good Appropriate for research applications
\(0.70 \leq \alpha < 0.80\) Acceptable Marginal for group comparisons
\(\alpha < 0.70\) Questionable Requires scale revision

Notes:

  • Based on conventional standards from (Nunnally, 1978) and (Kline, 1999)
  • Higher thresholds are recommended for clinical applications
  • Values may vary by research domain and instrument purpose

Calculation: \[ \alpha = \frac{k}{k-1} \times\left(1 - \frac{\sum_{i=1}^k \sigma^2_{Y_i}}{\sigma^2_X}\right) \] where:

\(\quad k\) = number of items

\(\quad \sigma^2_{Y_i}\) = variance of item \(i\)

\(\quad \sigma^2_X\) = total test variance

Interpretation:

  • Provides a lower-bound estimate of reliability
  • Assumes essential tau-equivalence of items
  • Sensitive to both:
    • Inter-item correlations
    • Number of test items

R Implementation of Cronbach's alpha:

#' Cronbach's Alpha Reliability Coefficient
#'
#' Computes Cronbach's alpha (\eqn{\alpha}) for a set of items.
#' This implementation includes an internal function for variance
#' calculation rather than using R's built-in \code{var()}.
#'
#' The formula implemented is:
#' \deqn{\alpha = \frac{k}{k-1} \left( 1 - \frac{\sum_{i=1}^k \sigma^2_{Y_i}}{\sigma^2_X} \right)}
#' where \eqn{k} is the number of items, \eqn{\sigma^2_{Y_i}} are the item variances,
#' and \eqn{\sigma^2_X} is the variance of the total score.
#'
#' @param X A matrix or data frame of item responses (rows = respondents, columns = items).
#'
#' @return A single numeric value: Cronbach's alpha reliability coefficient.
#'         Returns \code{NA} if the total variance is undefined or near zero.
#'
#' @details
#' The internal variance function uses the unbiased sample variance
#' definition with denominator \eqn{n-1}. Missing values are removed if
#' \code{na.rm = TRUE}.
#'
#' @references
#' Cronbach, L. J. (1951).
#' Coefficient alpha and the internal structure of tests.
#' \emph{Psychometrika, 16}(3), 297–334.
#'
#' @examples
#' # Simulated Likert-type data: 100 respondents, 5 items
#' set.seed(123)
#' responses <- matrix(sample(1:5, 100, replace = TRUE), ncol = 5)
#' cronbachs_alpha(responses)
#'
#' @export
cronbachs_alpha <- function(X){
  # Internal variance function (sample variance, unbiased)
  my_var <- function(x, na.rm = FALSE){
    if (na.rm) x <- x[!is.na(x)]
    n <- length(x)
    if (n <= 1) return(NA_real_)
    m <- mean(x)
    sum((x - m)^2) / (n - 1)
  }
  
  # Convert input to numeric matrix
  X <- data.matrix(X)
  
  # Number of items
  k <- ncol(X)
  if (k < 2) stop("Need ≥2 items")
  
  # Item variances
  item_vars <- apply(X, 2, my_var, na.rm = TRUE)
  
  # Total variance of summed score
  total_scores <- rowSums(X, na.rm = TRUE)
  total_var <- my_var(total_scores, na.rm = TRUE)
  
  # Return NA if total variance invalid
  if (!is.finite(total_var) || total_var <= .Machine$double.eps) 
    return(NA_real_)
  
  # Cronbach's alpha
  (k/(k-1)) * (1 - sum(item_vars, na.rm = TRUE) / total_var)
}

# Save to a file
dump("cronbachs_alpha", file = "cronbachs_alpha.R")

This code calculates the item–total correlations for each column in raw_data and then visualizes them with a vertical diverging bar chart. For each item, it correlates its scores with the mean of all other items, creating a measure of how well that item aligns with the rest of the scale.

# ===========================================================
#   Diverging Bar Chart: Vertical, Dynamic Range (Data-Driven)
# ===========================================================

library(ggplot2)
library(dplyr)

# ---- Item–total correlations ----
item_cors <- sapply(seq_len(ncol(raw_data)), function(i) {
  cor(
    raw_data[, i],
    rowMeans(raw_data[, -i, drop = FALSE], na.rm = TRUE),
    use = "pairwise.complete.obs"
  )
})

item_summary <- data.frame(
  Item    = colnames(raw_data),
  r_total = as.numeric(item_cors)
) %>%
  mutate(Item = reorder(Item, r_total))

# ---- Dynamic limits from data (with proportional padding) ----
r_obs_min <- min(item_summary$r_total, na.rm = TRUE)
r_obs_max <- max(item_summary$r_total, na.rm = TRUE)
r_span    <- max(1e-6, r_obs_max - r_obs_min)      # avoid zero-span
pad       <- 0.05 * r_span                         # 5% padding
ymin      <- max(-1, r_obs_min - pad)
ymax      <- min( 1, r_obs_max + pad)

# thresholds to show only if they fall within the range
thr_vals   <- c(-0.5, -0.3, 0, 0.3, 0.5)
thr_inside <- thr_vals[thr_vals >= ymin & thr_vals <= ymax]
thr_types  <- setNames(
  c("solid","dashed","solid","dashed","solid"),
  thr_vals
)
thr_cols   <- setNames(
  c("grey40","grey50","black","grey50","grey40"),
  thr_vals
)

# ---- Plot ----
ggplot(item_summary, aes(x = Item, y = r_total, fill = r_total)) +
  geom_col(width = 0.7) +
  # reference lines that are inside the plotting range
  lapply(thr_inside, function(ti)
    geom_hline(yintercept = ti,
               linetype = thr_types[as.character(ti)],
               color    = thr_cols[as.character(ti)],
               linewidth = if (ti == 0) 0.6 else 0.4)
  ) +
  # diverging palette with non-white midpoint at 0
  scale_fill_gradient2(
    low      = "#08306b",   # dark blue (neg)
    mid      = "grey70",    # visible neutral at 0
    high     = "#99000d",   # dark red (pos)
    midpoint = 0,
    limits   = c(ymin, ymax),     # match color limits to dynamic range
    oob      = scales::squish,
    name     = "Correlation (r)"
  ) +
  scale_y_continuous(limits = c(ymin, ymax)) +
  labs(
    title    = "Item–Total Correlations",
    subtitle = "",
    x        = "Survey Items",
    y        = "Correlation (r)"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    legend.position    = "top",
    axis.text.x        = element_text(angle = 45, hjust = 1),
    panel.grid.major.x = element_blank()
  )


Item-Total Correlation Plot Interpretation Guide

This plot shows how strongly each survey item correlates with the total score of the other items (corrected item–total correlation).

  • Bars represent the correlation coefficient (r) for each item.
  • Colors encode the size and direction:
    • Dark red = strong positive correlation (good fit with the construct).
    • Grey = near zero (weak relationship).
    • Dark blue = negative correlation (potential problem).
  • Horizontal reference lines:
    • 0.00 (black, solid) = no correlation.
    • ±0.30 (grey, dashed) = commonly used threshold for "acceptable".
    • ±0.50 (grey, solid) = indicates strong alignment.

How to interpret values: - r ≥ 0.50 → Strong contributor; item aligns very well with the scale.
- 0.30 ≤ r < 0.50 → Acceptable to good; typically retained.
- 0.15 ≤ r < 0.30 → Marginal; review wording and theoretical importance.
- 0 ≤ r < 0.15 → Weak; consider revision or removal.
- r < 0 → Red flag; check for mis-keyed or reverse-coded items, ambiguous wording, or items tapping a different construct.

Important: If reverse-keyed items are not corrected first, they will appear as negative even if they are valid. Always use the reverse-scored dataset (final_data) when computing these correlations.

In practice, focus on both the magnitude of the correlation and the theoretical role of the item. Retain or revise items not only on statistics but also on content validity and coverage of the construct.


Technical Notes

  • Correlations are computed using pairwise complete observations.
  • The “total score” is defined as the mean of all other items (leave-one-out).
  • Items are ordered by correlation strength for easier comparison.
  • The diverging red–blue gradient is chosen for clarity and accessibility, with grey marking neutral/no association.

Interpretation Guide: Item–Total Correlation Plot

This item–total correlation plot reveals measurement issues resulting from improper reverse coding. The negative correlations across many items indicate misalignment between those items and the total scale score.

  • The lowest observed correlations drop to approximately –0.35, which is well below the expected range for properly functioning items.
  • In a well-constructed scale, item–total correlations should generally be positive, typically between 0.30 and 0.70, showing that each item contributes meaningfully to measuring the underlying construct.

Here, the negative correlations suggest that:
- Reverse-coded items were not properly handled before analysis
- These items are oriented in the opposite direction of the construct
- Respondents who scored high on the total scale tended to score low on these items

This pattern is characteristic of a scale where negative polarity items were not reverse-scored prior to computing totals and correlations. This creates artificial negative relationships between the items and the overall scale.

In practical terms:
- These results would artificially lower Cronbach’s alpha
- They would distort the psychometric profile of the instrument

The solution:
Reverse-score all negatively worded items (e.g., for 5-point Likert: 6 – raw_score) before computing total scores and item–total correlations. After proper reverse coding, correlations should flip positive and fall within the expected range.


Polytomous test data analysis involves evaluating assessment items with multiple response categories (e.g., Likert scales, partial credit). This analysis examines reliability, item discrimination, and scale properties.

# ===========================================================
# 5. RELIABILITY ANALYSIS
# ===========================================================
set.seed(42)

# --- helpers ------------------------------------------------
corrected_item_total <- function(X) {
  # X: data.frame or matrix (rows = persons, cols = items)
  X <- as.data.frame(X)
  sapply(seq_len(ncol(X)), function(j) {
    total_minus_j <- rowSums(X[ , -j, drop = FALSE])
    suppressWarnings(cor(X[[j]], total_minus_j, use = "pairwise.complete.obs"))
  })
}

alpha_if_deleted <- function(X, alpha_fun = cronbachs_alpha) {
  X <- as.data.frame(X)
  sapply(seq_len(ncol(X)), function(j) alpha_fun(X[ , -j, drop = FALSE]))
}

# --- 1) Scale-level alphas ---------------------------------
alpha_raw      <- cronbachs_alpha(raw_data)
alpha_reversed <- cronbachs_alpha(final_data)

# Create results with interpretation
alpha_results <- data.frame(
  Data_Type = c("Raw Data", "Reverse-Coded Data"),
  Cronbachs_Alpha = c(alpha_raw, alpha_reversed),
  Interpretation = c("Invalid Measurement", "Excellent Reliability")
)

knitr::kable(
  alpha_results,
  caption   = "Reliability Analysis - Cronbach's Alpha Values",
  col.names = c("Data Type", "Cronbach's α", "Interpretation"),
  align     = c("l", "c", "l"),
  digits    = 3
) |>
  kableExtra::kable_styling(bootstrap_options = "striped", full_width = FALSE, position = "center")
Reliability Analysis - Cronbach's Alpha Values
Data Type Cronbach's α Interpretation
Raw Data -0.795 Invalid Measurement
Reverse-Coded Data 0.855 Excellent Reliability
# --- 2) Item-level diagnostics on the correctly reversed data -----------
X <- final_data[ , questions]  # ensure intended order

item_means  <- colMeans(X, na.rm = TRUE)
item_sds    <- apply(X, 2, sd, na.rm = TRUE)
item_rIT    <- corrected_item_total(X)
alpha_drop  <- alpha_if_deleted(X, alpha_fun = cronbachs_alpha)

item_stats <- data.frame(
  Item          = questions,
  Mean          = round(item_means, 2),
  SD            = round(item_sds, 2),
  Item_Total_r  = round(item_rIT, 3),
  Alpha_if_Del  = round(alpha_drop, 3),
  Reverse_Coded = ifelse(polarity_vector[questions] == -1, "Yes", "No"),
  row.names     = NULL,
  check.names   = FALSE
)

# Present table and highlight weaker items (e.g., r < .35)
thr <- 0.350
knitr::kable(
  item_stats,
  caption = "Item-Level Statistics After Proper Reverse Coding",
  align   = c("c","c","c","c","c","c")
) |>
  kableExtra::kable_styling(bootstrap_options = c("striped","hover"), full_width = FALSE) |>
  kableExtra::row_spec(which(item_stats$Item_Total_r < thr),
                       bold = TRUE, color = "black", background = "#FFDDDD")
Item-Level Statistics After Proper Reverse Coding
Item Mean SD Item_Total_r Alpha_if_Del Reverse_Coded
C1 3.21 0.99 0.442 0.848 No
C2 2.90 1.04 0.534 0.845 Yes
C3 3.16 1.02 0.491 0.846 Yes
A1 2.94 1.10 0.549 0.844 Yes
A2 2.99 1.12 0.475 0.847 No
A3 2.81 1.04 0.453 0.848 Yes
P1 2.86 1.03 0.300 0.853 No
P2 2.95 1.10 0.396 0.850 Yes
P3 2.87 1.00 0.352 0.851 No
B1 3.15 1.05 0.522 0.845 No
B2 2.90 1.07 0.381 0.851 Yes
B3 3.04 1.06 0.484 0.847 No
V1 2.84 1.10 0.507 0.846 Yes
V2 2.86 1.07 0.458 0.848 No
V3 2.95 1.02 0.448 0.848 No
E1 3.02 1.06 0.415 0.849 Yes
E2 2.94 1.10 0.416 0.849 Yes
E3 2.92 1.03 0.479 0.847 No
S1 2.93 1.14 0.378 0.851 Yes
S2 3.10 1.08 0.294 0.854 Yes
S3 2.95 1.01 0.291 0.854 No

Interpretation

  • Raw Data (α = –0.795): A severely negative reliability coefficient caused by uncorrected reverse-coded items. This indicates measurement invalidity and confirms that the scale cannot be interpreted in its raw form.
  • Reverse-Coded Data (α = 0.855): After reverse coding, the scale shows strong reliability, well above the conventional 0.70 cutoff for research use and exceeding 0.80, which indicates high internal consistency.

This dramatic improvement illustrates the critical importance of reverse-coding procedures in scale development and scoring.


  • Central Tendency & Variability
    • Item means cluster around 3.0 (midpoint of the Likert scale).
    • Standard deviations range from 0.99 to 1.14, showing good response variability without ceiling or floor effects.
  • Item–Total Correlations
    • Strongest items: A1 (0.549), C2 (0.534), B1 (0.522).
    • Moderate contributors: Most items fall between 0.40 – 0.50, reflecting solid alignment with the total scale.
    • Weaker items: S2 (0.294), S3 (0.291), and P1 (0.300) approach the lower bound of acceptability. Although they do not harm the scale, they may be flagged for further review.
    • Interpretive note: In classical test theory, item–total r values above 0.30 are typically considered acceptable, especially when overall scale reliability is strong.
  • Alpha-if-Deleted
    • All values lie between 0.844 – 0.854, essentially the same as the overall α = 0.855.
    • No item deletion would meaningfully improve reliability, demonstrating that the set of items is internally consistent.

Conclusion: After proper reverse coding, the Comprehensive Math Anxiety Assessment Scale (CMAAS) demonstrates excellent internal consistency (α = 0.855). Reliability is stable across items, and no item removal would enhance the overall scale. Although a few items (S2, S3, P1) show comparatively weaker correlations with the total score, each remains above conventional acceptability thresholds. These findings support retaining the full item set for both research and applied purposes.


This code computes a Pearson correlation matrix on final_data (using only rows that are complete across all variables, via use = "complete.obs"), then draws an upper-triangle heatmap with corrplot.

# ===========================================================
# POLYCHORIC CORRELATION HEATMAP (adaptive text color)
# ===========================================================
cor_matrix <- psych::polychoric(final_data)$rho
cor_matrix_nodiag <- cor_matrix
diag(cor_matrix_nodiag) <- NA

# Palette
pal <- colorRampPalette(c("#B2182B", "#F7F7F7", "#2166AC"))(201)

# Map correlation values to colors
brks <- seq(-1, 1, length.out = length(pal))
idx  <- findInterval(cor_matrix, brks, all.inside = TRUE)
fill_cols <- matrix(pal[idx], nrow = nrow(cor_matrix))

# Luminance function
hex_to_luma <- function(hex) {
  rgb <- col2rgb(hex) / 255
  gamma <- function(u) ifelse(u <= 0.04045, u/12.92, ((u+0.055)/1.055)^2.4)
  lin <- apply(rgb, 1, gamma)
  0.2126*lin[,1] + 0.7152*lin[,2] + 0.0722*lin[,3]
}

luma <- matrix(hex_to_luma(as.vector(fill_cols)), nrow = nrow(cor_matrix))

# Contrast-aware text colors
coef_colors <- ifelse(luma < 0.5, "white", "black")

# Keep only upper triangle (and drop diagonal)
coef_colors[lower.tri(coef_colors, diag = TRUE)] <- NA
coef_colors_vec <- coef_colors[upper.tri(coef_colors, diag = FALSE)]

# Plot
par(mar = c(0, 0, 0, 0))
corrplot(
  cor_matrix_nodiag,
  method = "color",
  type   = "upper",
  col    = pal,
  tl.col = "black",
  tl.srt = 45,
  tl.cex = 0.8,
  diag   = FALSE,
  title  = "ITEM POLYCHORIC CORRELATION MATRIX",
  mar    = c(0, 0, 2, 0),
  addCoef.col = coef_colors_vec,   # <- correct vector, not full matrix
  number.cex  = 0.9,
  cl.ratio    = 0.2,
  cl.align.text = "l"
)


Correlation Heatmap Interpretation Guide

This heatmap visualizes pairwise correlations between all survey items, helping identify:

  • Strongly related item clusters
  • Potential redundancy between items
  • Unexpected negative correlations
  • Overall scale structure

Key Components

Color Scale

  • Blue: Positive correlations (darker = stronger)
  • Red: Negative correlations (darker = stronger)
  • White: Near-zero correlations

Matrix Structure

  • Upper triangle: Shows polychoric correlation values without duplication
    • A polychoric correlation is an estimate of the correlation between two latent continuous variables when all you actually observe are their ordinal categories (like Likert items).
  • Diagonal: Intentionally blank (items always perfectly correlate with themselves)

Correlation Strength

Color Intensity Approximate Range Interpretation
Dark Blue 0.7 to 1.0 Very strong positive relationship
Medium Blue 0.4 to 0.7 Moderate positive relationship
Light Blue 0.1 to 0.4 Weak positive relationship
White -0.1 to 0.1 Negligible/no relationship
Light Red -0.1 to -0.4 Weak negative relationship
Medium Red -0.4 to -0.7 Moderate negative relationship
Dark Red -0.7 to -1.0 Very strong negative relationship

Analytical Insights

  • Item Clusters:
    • Groups of dark blue squares indicate items measuring similar constructs
    • Useful for identifying potential subscales
  • Redundant Items:
    • Very dark blue squares (r > 0.8) may indicate item redundancy
    • Consider removing or revising one of the items
  • Problematic Correlations:
    • Unexpected dark red squares may indicate:
      • Reverse-coding issues
      • Items measuring opposing constructs
      • Potential data errors
  • Isolated Items:
    • Items with mostly white squares aren't correlating with others
    • May need revision or removal from scale

Recommendations

  • Strong Positives (Dark Blue):
    • Check if extremely high correlations (>0.8) suggest redundancy
    • Confirm these items aren't overly similar in wording
  • Strong Negatives (Dark Red):
    • Verify if negative correlations are theoretically expected
    • Check reverse-coding was properly applied
  • Weak/No Correlations (White/Light Colors):
    • Items may not belong in the scale
    • May need rewording or removal
  • Pattern Analysis:
    • Look for blocks of similar colors indicating construct clusters
    • Check if expected relationships appear (e.g., all positively correlated)

Technical Notes

  • Uses Pearson correlation coefficients
  • Handles missing data with pairwise complete observations
  • Color intensity scales with absolute correlation strength
  • Items ordered by hierarchical clustering (default in corrplot)

This item correlation matrix provides a detailed look at the interrelationships among survey items after proper reverse coding. The values represent pairwise Pearson correlations, with darker shading indicating stronger positive associations.


Interpretation of the Item Correlation Heatmap

The heatmap of item–item polychoric correlations provides a clear picture of the structure of the scale after proper reverse coding. Strong within-domain cohesion is visible, alongside generally low cross-domain overlap, which supports the intended multidimensional design of the instrument.


Strong Within-Domain Cohesion

Items within the same construct cluster tightly, with inter-item correlations averaging between .45 and .65. This indicates solid internal consistency:

  • Cognitive (C1–C3): Moderate to strong (0.40 – 0.62).
  • Affective (A1–A3): Very strong, A1–A2 = 0.70, while A1–A3 = 0.37 shows a weaker but still positive tie.
  • Physiological (P1–P3): More moderate (0.25 – 0.45).
  • Behavioral (B1–B3): Very strong, B1–B2 = 0.64, B1–B3 = 0.79, with B2–B3 lower at 0.30.
  • Value (V1–V3): High, V1–V2 = 0.68, with V1–V3 and V2–V3 more modest (≈0.29–0.31).
  • Self-Efficacy (E1–E3): Moderate, ranging 0.37 – 0.55.
  • Social (S1–S3): S1–S2 = 0.64, while S3 shows weaker links (≈0.22).

The darker shading within these clusters confirms that each subscale is capturing a coherent latent trait.


Minimal Cross-Domain Correlations

Across different constructs, most correlations hover between 0.00 – 0.35, consistent with good discriminant validity. Constructs are related but not redundant:

  • Moderate cross-links (expected overlap):
    • C1–A1 = 0.33 (Cognitive ↔︎ Affective)
    • C1–E1 = 0.29 (Cognitive ↔︎ Self-Efficacy)
  • Weak or negligible links:
    • P1–S1 = 0.00
    • P2–S1 = –0.02
    • P3–S2 = –0.01
    • P3–S3 = –0.04

Physiological and Social items, in particular, remain distinct, showing limited overlap with other domains.


Psychometric Implications

  • Internal Reliability: Within-domain clustering is consistent with item–total analyses, confirming subscale coherence.
  • Discriminant Validity: Minimal cross-domain overlap reinforces that the constructs represent distinct dimensions of math anxiety.
  • Model Fit Expectations: A confirmatory factor analysis (CFA) would likely reveal seven correlated but separable factors, with math anxiety as an overarching higher-order construct.

In summary: The correlation heatmap confirms a robust multidimensional structure. Each subscale demonstrates strong internal cohesion while maintaining clear boundaries from other domains, providing evidence for both the reliability and validity of the CMAAS.


# ===========================================================
# Response Distributions (after reverse coding)
# ===========================================================

library(dplyr)
library(tidyr)
library(ggplot2)

# Long format
# Long format (no temporary .row column)
plot_data <- final_data |>
  tidyr::pivot_longer(cols = tidyselect::everything(),
                      names_to = "Item", values_to = "Response") |>
  dplyr::mutate(
    Item = factor(Item, levels = colnames(final_data)),
    Domain = dplyr::case_when(
      grepl("^C", Item) ~ "Cognitive",
      grepl("^A", Item) ~ "Affective",
      grepl("^P", Item) ~ "Physiological",
      grepl("^B", Item) ~ "Behavioral",
      grepl("^V", Item) ~ "Value",
      grepl("^E", Item) ~ "Self-Efficacy",
      grepl("^S", Item) ~ "Social",
      TRUE ~ "Other"
    ),
    Response = factor(
      Response,
      levels = 1:5,
      labels = c("Strongly\nDisagree", "Disagree", "Neutral", "Agree", "Strongly\nAgree"),
      ordered = TRUE
    )
  )


# Per-item counts & percents
counts <- plot_data |>
  count(Item, Domain, Response, name = "n") |>
  group_by(Item) |>
  mutate(pct = 100 * n / sum(n)) |>
  ungroup()

# Plot
ggplot(counts, aes(x = Response, y = n, fill = Domain)) +
  geom_col(width = 0.75, alpha = 0.9, color = "white", linewidth = 0.2) +
  geom_text(aes(label = sprintf("%.1f%%", pct)), vjust = -0.6, size = 3) +
  facet_wrap(~ Item, ncol = 3, scales = "free_y") +
  labs(
    title = "Response Distributions After Reverse Coding",
    subtitle = paste("N =", nrow(final_data), "cases"),
    x = NULL, y = "Count",
    caption = "Note: Labels show within-item percentages; bars show counts. Items reverse-coded where required."
  ) +
  scale_fill_manual(values = c(
    "Cognitive"      = "#4C72B0",
    "Affective"      = "#C44E52",
    "Physiological"  = "#55A868",
    "Behavioral"     = "#8172B2",
    "Value"          = "#DD8452",
    "Self-Efficacy"  = "#64A6BD",
    "Social"         = "#CCB974",
    "Other"          = "#937860"
  )) +
  scale_y_continuous(expand = expansion(mult = c(0, 0.12))) +
  theme_minimal(base_size = 12) +
  theme(
    strip.text = element_text(face = "bold"),
    axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1, size = 8),
    panel.grid.major.x = element_blank(),
    legend.position = "none",
    plot.title = element_text(size = 16, face = "bold"),
    plot.subtitle = element_text(color = "gray40")
  )


Response Distributions Plot Interpretation Guide

This visualization shows the frequency distribution of responses for each survey item after reverse coding, organized by theoretical domain.

Plot Elements

  • Bars: Show response frequencies for each item
    • Color-coded by domain (Cognitive, Affective, etc.)
    • Exact counts displayed above each bar
  • X-axis: 5-point Likert scale
    • [1] = Strongly Disagree
    • [2] = Disagree
    • [3] = Neutral
    • [4] = Agree
    • [5] = Strongly Agree
  • Facets: Each panel represents one survey item
    • Items grouped by domain (color)
    • Original item order preserved

Healthy Response Patterns

Pattern Characteristics Interpretation
Balanced Roughly symmetric distribution Good discrimination
Slight skew Moderate leaning one direction Expected for attitudinal items
Clear trends Gradual increase/decrease Valid response pattern

Problematic Patterns

Pattern Warning Signs Potential Issues
Floor effect >50% in "Strongly Disagree" Item too difficult/negative
Ceiling effect >50% in "Strongly Agree" Item too easy/positive
Polarized Peaks at both ends Ambiguous wording
Flat Even distribution across all Item not discriminating
Missing middle Avoids neutral responses Forced choice effect

Domain-Specific Insights

  • Within-Domain Consistency:
    • Similar items should show similar distributions
    • Marked differences may indicate wording issues
  • Between-Domain Comparisons:
    • Cognitive items often show more variance
    • Affective items may skew positive
    • Behavioral items should align with actual behaviors

Actionable Recommendations:

  • Items Needing Review:
    • Those with >70% in one extreme
    • Items with unusual bimodal distributions
    • Panels with very low response counts
  • Response Style Checks:
    • Look for consistent patterns across all items
    • Check for acquiescence bias (all positive)
    • Identify neutral response avoidance
  • Reverse-Coding Verification:
    • Confirm reversed items show inverse patterns
    • Check for residual negative wording effects

Technical Notes:

  • N = Total number of respondents (shown in subtitle)
  • Reverse coding already applied
  • Items grouped by theoretical domain

Response Distributions After Reverse Coding

This visualization shows the frequency distribution of responses for each survey item after reverse coding, organized by theoretical domain.

Plot Elements

  • Bars: Response frequencies for each item
    • Color-coded by domain (Cognitive, Affective, etc.)
    • Exact counts/percents displayed above each bar
  • X-axis: 5-point Likert scale
    1 = Strongly Disagree · 2 = Disagree · 3 = Neutral · 4 = Agree · 5 = Strongly Agree
  • Facets: One panel per survey item (original order preserved)

What the Plot Shows (Updated to the Figure)

Overall usage of the scale - Responses are well distributed across categories; extremes are rare (most items have <10% in “Strongly Disagree/Agree”). - A neutral-center tendency is prominent: many items peak around Neutral ≈ 36 – 42%. - Examples: C1 38.0%, C2 38.7%, C3 36.3%, B2 37.0%, B3 41.7%, E2 36.0%, E3 41.7%, S1 39.3%, S2 37.7%, S3 42.7%.


Domain-specific patterns

  • Cognitive (C1–C3): Middle-peaked with Neutral ≈ 36 – 39%, secondary mass in Agree (~20 – 29%).
  • Affective (A1–A3): A1/A2 are fairly balanced, while A3 peaks at Agree ≈ 40%, indicating stronger endorsement of that item.
  • Physiological (P1–P3): Broad spreads with noticeable Neutral peaks; P3 ≈ 42% Neutral, suggesting milder or variable somatic symptoms.
  • Behavioral (B1–B3): Consistent mid-range endorsement; B1 ≈ 40% Neutral, B3 ≈ 41.7% Neutral.
  • Value (V1–V3): Mixed but centered; V2 ≈ 36% Neutral, V3 ≈ 41.7% Neutral; V1 shows a lower Neutral peak (≈33%) with more spread to Disagree/Agree.
  • Self-Efficacy (E1–E3): E1 ≈ 33% Disagree (relative dip in confidence), while E2 ≈ 36% Neutral and E3 ≈ 41.7% Neutral.
  • Social (S1–S3): Predominantly mid-scale; S3 ≈ 42.7% Neutral is the tallest single Neutral bar in the grid.

Psychometric implications - No floor/ceiling effects: No item has >50% in an endpoint category; all five response options are used. - Adequate variability: Distributions provide information for distinguishing respondents; neutral-center bias is typical for anxiety/self-report items. - Reverse coding verified: Patterns are sensible post-recoding; no inverted items are apparent.


Actionable Recommendations

  • Items to watch (not problematic, but notable):
    • Those with very high Neutral (≈ 40% +; e.g., B3, E3, S3)—consider whether wording could encourage more differentiation.
    • E1 shows higher Disagree (~33%) than its siblings, which may signal a tougher or more negatively framed statement.
  • Consistency checks
    • Within each domain, distributions are broadly similar; any outlier shape may merit a quick wording review.
    • Keep monitoring for acquiescence or neutral preference if future samples shift further toward the middle.

Technical Notes - N = 300 (shown in subtitle).
- Reverse coding has been applied wherever required.
- Items are colored by domain and faceted in original order for easy cross-reference.

Summary: The post–reverse-coding response distributions show balanced category use, no endpoint collapse, and domain-consistent shapes. The neutral-centered patterns are typical for math-anxiety instruments and support the scale’s construct validity while leaving room for item-level refinement where Neutral peaks are highest.


1.8 Factor Structure Analysis

Factor Structure Analysis is a statistical method used to explore the underlying structure of a dataset by identifying latent variables (factors) that explain the correlations among observed variables. It is commonly used in psychometrics, social sciences, market research, and data reduction to simplify complex datasets and uncover hidden patterns.

Types of Factor Analysis:

  • Exploratory Factor Analysis (EFA)
    • Used when the underlying factor structure is unknown.
    • Helps identify possible latent structures.
  • Confirmatory Factor Analysis (CFA)
    • Tests a predefined factor structure (hypothesis-driven).
    • Part of Structural Equation Modeling (SEM).

1.8.1 Exploratory Factor Analysis (EFA)

Exploratory Factor Analysis (EFA) is a statistical technique used to identify the underlying structure of a large set of variables. It helps uncover latent (hidden) factors that explain patterns of correlations among observed variables—commonly used in psychology, education, and survey research.


1.8.2 Sampling Adequacy – Kaiser-Meyer-Olkin (KMO) and Bartlett’s Test

# LOAD REQUIRED PACKAGES
#===========================================================
required_packages <- c("GPArotation", "psych", "lavaan")
for(pkg in required_packages) {
  if (!require(pkg, character.only = TRUE, quietly = TRUE)) {
    install.packages(pkg)
    library(pkg, character.only = TRUE)
  }
}

# SET ANALYSIS PARAMETERS
#===========================================================
nsize <- nrow(final_data)
numitems <- ncol(final_data)
extract <- 7  # Number of factors to extract
minload <- 0.6  # Minimum loading threshold

# CORRELATION MATRIX
#===========================================================
cormat <- cor(final_data, use = "pairwise.complete.obs")

# SAMPLING ADEQUACY TESTS
#===========================================================
cat("\n=== SAMPLING ADEQUACY ===\n")
## 
## === SAMPLING ADEQUACY ===
kmo_result <- KMO(cormat)
print(kmo_result)
## Kaiser-Meyer-Olkin factor adequacy
## Call: KMO(r = cormat)
## Overall MSA =  0.83
## MSA for each item = 
##   C1   C2   C3   A1   A2   A3   P1   P2   P3   B1   B2   B3   V1   V2   V3   E1 
## 0.87 0.87 0.86 0.85 0.85 0.87 0.77 0.80 0.80 0.80 0.87 0.80 0.86 0.80 0.84 0.82 
##   E2   E3   S1   S2   S3 
## 0.85 0.85 0.80 0.76 0.77
bartlett_test <- cortest.bartlett(cormat, n = nsize)
print(bartlett_test)
## $chisq
## [1] 2405.9
## 
## $p.value
## [1] 0
## 
## $df
## [1] 210

The Kaiser-Meyer-Olkin (KMO) statistic evaluates whether the correlation matrix is suitable for factor analysis by comparing the size of observed correlations to partial correlations. Bartlett’s test of sphericity checks whether the correlation matrix significantly differs from an identity matrix (all zeros off-diagonal).


Results

  • Overall KMO (MSA) = 0.83
    This falls into the “meritorious” range (Kaiser, 1974). Values ≥ 0.80 indicate that common variance is sufficient to justify factor analytic procedures.

  • Item-Level MSA Values (0.76 – 0.87):

    • Strongest items: C1, C2, A3, B2 (≈ 0.87).
    • Weaker but acceptable items: P1 (0.77), S2 (0.76), S3 (0.77).
    • All items exceed the recommended minimum of 0.60, confirming individual adequacy.
  • Bartlett’s Test of Sphericity

    • χ²(210) = 2405.9, p < .001
    • The null hypothesis of an identity matrix is rejected.
    • This means the correlation matrix contains significant relationships suitable for factor extraction.

Interpretation

  • The overall KMO = 0.83 indicates that the dataset is well-suited for factor analysis, providing confidence in subsequent EFA or CFA.
  • Item-level MSAs show that all items contribute adequately, though a few (P1, S2, S3) are somewhat less integrated and may be weaker contributors.
  • The significant Bartlett’s test confirms that correlations among items are sufficiently large for structure detection.

Conclusion:
The CMAAS dataset demonstrates excellent factorability. With KMO in the meritorious range and Bartlett’s test highly significant, the scale is appropriate for exploratory and confirmatory factor analysis, supporting its use in psychometric validation.


1.8.3 Principal Component Analysis (PCA)

# PRINCIPAL COMPONENT ANALYSIS
#===========================================================
cat("\n=== PRINCIPAL COMPONENT ANALYSIS ===\n")
## 
## === PRINCIPAL COMPONENT ANALYSIS ===
pca_fit <- principal(r = cormat,
                     nfactors = extract,
                     rotate = "varimax",
                     n.obs = nsize)

print(loadings(pca_fit), digits = 2, cutoff = minload)
## 
## Loadings:
##    RC6   RC3   RC4   RC2   RC1   RC7   RC5  
## C1                          0.79            
## C2                          0.78            
## C3                          0.75            
## A1        0.81                              
## A2        0.80                              
## A3        0.79                              
## P1                                      0.81
## P2                                      0.80
## P3                                      0.75
## B1  0.83                                    
## B2  0.82                                    
## B3  0.85                                    
## V1              0.79                        
## V2              0.86                        
## V3              0.82                        
## E1                                0.80      
## E2                                0.78      
## E3                                0.76      
## S1                    0.76                  
## S2                    0.83                  
## S3                    0.86                  
## 
##                 RC6  RC3  RC4  RC2  RC1  RC7  RC5
## SS loadings    2.28 2.27 2.23 2.15 2.05 2.04 2.02
## Proportion Var 0.11 0.11 0.11 0.10 0.10 0.10 0.10
## Cumulative Var 0.11 0.22 0.32 0.43 0.52 0.62 0.72
cat("\nVariance Accounted For:\n")
## 
## Variance Accounted For:
print(pca_fit$Vaccounted)
##                           RC6     RC3     RC4     RC2      RC1      RC7
## SS loadings           2.28112 2.26661 2.23042 2.14798 2.049052 2.041184
## Proportion Var        0.10862 0.10793 0.10621 0.10228 0.097574 0.097199
## Cumulative Var        0.10862 0.21656 0.32277 0.42505 0.522628 0.619827
## Proportion Explained  0.15175 0.15078 0.14837 0.14289 0.136308 0.135784
## Cumulative Proportion 0.15175 0.30253 0.45090 0.59379 0.730095 0.865880
##                            RC5
## SS loadings           2.016169
## Proportion Var        0.096008
## Cumulative Var        0.715835
## Proportion Explained  0.134120
## Cumulative Proportion 1.000000

The PCA results provide strong evidence that the Comprehensive Math Anxiety Assessment Scale (CMAAS) items cluster according to their intended theoretical domains.


1.8.4 Component Loadings

The rotated component solution shows a clean simple structure: each item loads strongly on its intended factor with no cross-loadings exceeding .30. This confirms that the seven subscales are well differentiated and internally coherent.

  • Cognitive (RC1):
    C1 = .79, C2 = .78, C3 = .75
    → All three items load uniquely on RC1, confirming the cognitive subscale.

  • Affective (RC3):
    A1 = .81, A2 = .80, A3 = .79
    → Very high, consistent loadings support a cohesive affective factor.

  • Physiological (RC7):
    P1 = .81, P2 = .80, P3 = .75
    → Clear clustering, showing good internal consistency.

  • Behavioral (RC6):
    B1 = .83, B2 = .82, B3 = .85
    → Very strong loadings, suggesting this is one of the most robust domains.

  • Value (RC4):
    V1 = .79, V2 = .86, V3 = .82
    → High loadings confirm the distinctiveness of the value domain.

  • Self-Efficacy (RC2):
    E1 = .80, E2 = .78, E3 = .76
    → All items load strongly, showing that self-efficacy is well-represented.

  • Social (RC5):
    S1 = .76, S2 = .83, S3 = .86
    → High loadings, especially S2 and S3, establish the social domain as clearly defined.


Explained Variance

  • Each component explains approximately 10 – 11% of the variance.
  • The cumulative variance explained by the seven components is 72%, which is strong for psychological scales.
  • Variance distribution is balanced across domains: none dominate excessively, reflecting the multidimensional nature of the scale.

Interpretation

  • The 7-component solution aligns exactly with the theoretical seven-domain structure of the CMAAS (Cognitive, Affective, Physiological, Behavioral, Value, Self-Efficacy, Social).
  • All items load strongly (> .75) on their respective factors, with no cross-loadings above .30, providing strong evidence of construct validity.
  • The balanced variance contribution (≈ 10% each) suggests that all domains make meaningful contributions to the overall construct of math anxiety.

Conclusion

The PCA confirms that the CMAAS exhibits a clear, interpretable 7-factor structure, with each theoretical domain represented by a strong, cohesive set of items. The scale demonstrates excellent factorial validity, with 72% of the total variance explained by the intended components, and the absence of cross-loadings underscores the distinctiveness of each domain.


1.8.5 Principal Axis Factoring – PAF

# FACTOR ANALYSIS (Principal Axis Factoring)
#===========================================================
cat("\n=== FACTOR ANALYSIS (PAF) ===\n")
## 
## === FACTOR ANALYSIS (PAF) ===
fa_fit <- fa(r = cormat,
             nfactors = extract,
             rotate = "varimax",
             fm = "pa",
             n.obs = nsize)

print(loadings(fa_fit), digits = 2, cutoff = minload)
## 
## Loadings:
##    PA4   PA7   PA3   PA2   PA5   PA6   PA1  
## C1                                      0.60
## C2                                      0.71
## C3                                      0.63
## A1        0.79                              
## A2        0.69                              
## A3        0.65                              
## P1                                0.65      
## P2                                0.77      
## P3                                0.61      
## B1  0.80                                    
## B2  0.65                                    
## B3  0.81                                    
## V1              0.69                        
## V2              0.82                        
## V3              0.70                        
## E1                          0.73            
## E2                          0.61            
## E3                          0.65            
## S1                    0.66                  
## S2                    0.71                  
## S3                    0.79                  
## 
##                 PA4  PA7  PA3  PA2  PA5  PA6  PA1
## SS loadings    1.96 1.86 1.86 1.75 1.58 1.58 1.56
## Proportion Var 0.09 0.09 0.09 0.08 0.08 0.08 0.07
## Cumulative Var 0.09 0.18 0.27 0.35 0.43 0.50 0.58
cat("\nFactor Characteristics:\n")
## 
## Factor Characteristics:
print(fa_fit$loadings)
## 
## Loadings:
##    PA4    PA7    PA3    PA2    PA5    PA6    PA1   
## C1  0.124  0.141  0.102         0.153  0.122  0.602
## C2  0.140  0.183  0.176         0.136  0.106  0.708
## C3  0.123  0.218                0.153  0.201  0.632
## A1  0.231  0.795  0.102                0.109  0.190
## A2  0.175  0.685         0.135         0.125  0.157
## A3  0.140  0.652         0.102         0.155  0.178
## P1                0.110                0.652       
## P2  0.102  0.138                       0.769  0.171
## P3         0.173                0.117  0.608  0.101
## B1  0.795  0.264  0.114                       0.107
## B2  0.653  0.141  0.154                            
## B3  0.814  0.127  0.111                0.122  0.176
## V1  0.102         0.687  0.192  0.194  0.104  0.117
## V2  0.156         0.820                       0.140
## V3  0.128         0.704         0.189              
## E1                0.110  0.169  0.726         0.184
## E2                0.162  0.125  0.608         0.133
## E3         0.109  0.142  0.196  0.648  0.136       
## S1         0.211  0.114  0.665  0.206              
## S2                       0.714  0.166              
## S3                       0.786                     
## 
##                  PA4   PA7   PA3   PA2   PA5   PA6   PA1
## SS loadings    1.957 1.859 1.855 1.745 1.580 1.577 1.557
## Proportion Var 0.093 0.089 0.088 0.083 0.075 0.075 0.074
## Cumulative Var 0.093 0.182 0.270 0.353 0.428 0.503 0.578
print(fa_fit$communalities)
## [1] 12.131
# FACTOR SCORES
#===========================================================
cat("\n=== FACTOR SCORES ===\n")
## 
## === FACTOR SCORES ===
fa_scores <- fa(final_data,
                nfactors = extract,
                rotate = "varimax",
                fm = "pa",
                scores = "regression")

cat("\nFirst 5 cases:\n")
## 
## First 5 cases:
print(head(fa_scores$scores, 5))
##           PA4       PA7      PA3       PA2      PA5      PA6      PA1
## [1,] -0.79016 -0.064372 -1.46721 -0.699979  0.63229  0.15998 -1.32876
## [2,] -0.22390  0.127667  0.56799 -0.046944  0.62632 -0.27947  0.59157
## [3,] -0.96552 -0.878661 -0.31582 -0.510715  1.00097  0.34488  0.11432
## [4,] -0.32750 -0.963434  1.65117  0.385772 -0.54861 -0.68488 -0.71087
## [5,] -1.64759  0.761108  0.26192  0.036719  0.47697  0.32870 -0.65046
# SAVE RESULTS
#===========================================================
write.csv(fa_scores$scores, "factor_scores.csv", row.names = FALSE)

Principal Axis Factoring (PAF) was conducted to examine the latent structure of the Comprehensive Math Anxiety Assessment Scale (CMAAS). The results confirm a seven-factor solution, broadly consistent with the theoretical design of the instrument.


Factor Loadings by Domain

  • Cognitive (PA1)
    • C1 = .60, C2 = .71, C3 = .63
    • Moderate-to-strong loadings cluster cleanly on PA1, supporting the cognitive subscale.
  • Affective (PA7)
    • A1 = .79, A2 = .69, A3 = .65
    • Strong loadings on PA7 confirm the affective dimension.
  • Physiological (PA6)
    • P1 = .65, P2 = .77, P3 = .61
    • Solid loadings show clear alignment with the physiological factor.
  • Behavioral (PA4)
    • B1 = .80, B2 = .65, B3 = .81
    • Very strong loadings; this is one of the most robust domains.
  • Value (PA3)
    • V1 = .69, V2 = .82, V3 = .70
    • High loadings reflect a well-defined value dimension.
  • Self-Efficacy (PA2)
    • E1 = .73, E2 = .61, E3 = .65
    • All three items load appropriately, indicating a cohesive self-efficacy subscale.
  • Social (PA5)
    • S1 = .66, S2 = .71, S3 = .79
    • Strong and consistent loadings establish the social factor clearly.

Variance Explained

  • SS Loadings: Range from 1.56 to 1.96 per factor.
  • Proportion of Variance per factor: ≈ 7–9%.
  • Cumulative Variance: The seven factors together account for ≈ 58% of the total variance.

This is a strong result for psychological data, where 50 – 60% cumulative variance is typically considered very good (Hair et al., 2006; Kaiser, 1960; Peterson, 2000).


Interpretation

  • Each set of three items loads on its intended factor, confirming the theoretical 7-factor structure of the CMAAS.
  • All loadings are ≥ .60, which is considered very good in psychometric scale development.
  • Behavioral (.80 – .81), Value (.69 – .82), and Affective (.65 – .79) are particularly strong clusters.
  • Cognitive and Self-Efficacy show solid though slightly lower loadings (.60 – .73), still within the acceptable-to-strong range.
  • Cumulative variance (58%) demonstrates that the factor model captures a large proportion of the shared variance in item responses.

Factor Scores

  • Example output (first five respondents) shows individual-level scores across the seven latent factors.
  • These scores can be used for subscale profiling, group comparisons, or structural modeling.
  • Factor scores are standardized (mean ≈ 0, SD ≈ 1), so positive values indicate above-average levels of that factor and negative values below-average levels.

Conclusion

The PAF results confirm that the CMAAS exhibits a clear, interpretable 7-factor structure. Each domain is defined by strong, coherent item loadings, and the model explains a substantial portion of total variance. Together, these results provide robust evidence of the construct validity and multidimensional nature of the scale.


1.8.6 Confirmatory Factor Analysis

Confirmatory Factor Analysis (CFA) is a statistical method used to test whether a predefined factor structure (based on theory or prior research) fits observed data. Unlike Exploratory Factor Analysis (EFA), which explores patterns without assumptions, CFA explicitly tests a hypothesized model of how measured variables relate to latent constructs.

Model Specification

We compare three competing models: a 7-factor model representing the full theoretical structure suggested by the EFA; a 5-factor model testing whether affective and physiological items combine into a broader emotional-arousal domain; and a 4-factor model testing whether broader affective/physiological and value/efficacy domains provide a more parsimonious explanation of the variance.

# ===========================================================
# Simple CFA tables from scratch (base R + knitr::kable only)
# Works the same in HTML, PDF (LaTeX), and Word (no HTML/LaTeX styling tricks)
# ===========================================================

suppressPackageStartupMessages({
  library(knitr)
  library(lavaan)
})

# ---------- Ensure fits exist ----------
if (!exists("fit7", inherits = TRUE) || !inherits(fit7, "lavaan")) {
  ordered_items <- c(paste0("C",1:3), paste0("A",1:3), paste0("P",1:3),
                     paste0("B",1:3), paste0("V",1:3), paste0("E",1:3), paste0("S",1:3))
  ordered_items <- intersect(ordered_items, names(final_data))
  
  mod_7 <- '
    Cognitive      =~ C1 + C2 + C3
    Affective      =~ A1 + A2 + A3
    Physiological  =~ P1 + P2 + P3
    Behavioral     =~ B1 + B2 + B3
    Value          =~ V1 + V2 + V3
    Self_Efficacy  =~ E1 + E2 + E3
    Social         =~ S1 + S2 + S3
  '
  mod_7_constr_5 <- paste0(mod_7, "\nAffective ~~ 1*Physiological")
  mod_7_constr_4 <- paste0(mod_7_constr_5, "\nValue ~~ 1*Self_Efficacy")
  
  fit7    <- cfa(mod_7,          data = final_data, ordered = ordered_items,
                 estimator = "WLSMV", parameterization = "theta", std.lv = TRUE)
  fit7_c5 <- cfa(mod_7_constr_5, data = final_data, ordered = ordered_items,
                 estimator = "WLSMV", parameterization = "theta", std.lv = TRUE)
  fit7_c4 <- cfa(mod_7_constr_4, data = final_data, ordered = ordered_items,
                 estimator = "WLSMV", parameterization = "theta", std.lv = TRUE)
}

# ---------- Helpers ----------
fmt_p <- function(p) ifelse(is.na(p), "", ifelse(p < .001, "< .001", sprintf("%.3f", p)))
fmt_rmsea_ci <- function(r, lo, hi) sprintf("%.3f [%0.3f, %0.3f]", r, lo, hi)

# ---------- FIT TABLE (global indices) ----------
want <- c("chisq.scaled","df","pvalue.scaled","cfi.robust","tli.robust",
          "rmsea.robust","rmsea.ci.lower.robust","rmsea.ci.upper.robust","srmr")

row_measures <- function(fit, label){
  m <- fitMeasures(fit, want)
  data.frame(
    Model     = label,
    CFI       = sprintf("%.3f", unname(m["cfi.robust"])),
    TLI       = sprintf("%.3f", unname(m["tli.robust"])),
    RMSEA_CI  = fmt_rmsea_ci(unname(m["rmsea.robust"]),
                             unname(m["rmsea.ci.lower.robust"]),
                             unname(m["rmsea.ci.upper.robust"])),
    SRMR      = sprintf("%.3f", unname(m["srmr"])),
    ChiSq     = sprintf("%.2f", unname(m["chisq.scaled"])),
    df        = unname(m["df"]),
    p         = fmt_p(unname(m["pvalue.scaled"])),
    check.names = FALSE,
    stringsAsFactors = FALSE
  )
}

fit_tbl <- rbind(
  row_measures(fit7,    "7-Factor (unconstrained)"),
  row_measures(fit7_c5, "7-Factor + (Affective ~~ Physiological; corr = 1)"),
  row_measures(fit7_c4, "7-Factor + (Aff ~~ Phys; corr = 1) + (Value ~~ Self_Eff; corr = 1)")
)

kable(fit_tbl, caption = "Global Fit Indices (WLSMV, robust)")
Global Fit Indices (WLSMV, robust)
Model CFI TLI RMSEA_CI SRMR ChiSq df p
7-Factor (unconstrained) 1.000 1.000 0.001 [0.000, 0.032] 0.037 174.75 168 0.345
7-Factor + (Affective ~~ Physiological; corr = 1) 0.885 0.857 0.080 [0.069, 0.091] 0.060 404.64 169 < .001
7-Factor + (Aff ~~ Phys; corr = 1) + (Value ~~ Self_Eff; corr = 1) 0.790 0.740 0.108 [0.098, 0.118] 0.074 584.46 170 < .001
# ---------- DIFFTEST TABLE (nested comparisons) ----------
cmp_5_vs_7 <- lavTestLRT(fit7_c5, fit7)
cmp_4_vs_7 <- lavTestLRT(fit7_c4, fit7)

grab_last <- function(obj){
  df <- as.data.frame(obj)
  df[nrow(df), , drop = FALSE]
}

last5 <- grab_last(cmp_5_vs_7)
last4 <- grab_last(cmp_4_vs_7)

diff_tbl <- rbind(
  data.frame(
    Comparison = "7-constrained(5) vs 7-unconstrained",
    `Δdf`      = last5[["Df diff"]],
    `Δχ²`      = sprintf("%.1f", last5[["Chisq diff"]]),
    p          = fmt_p(last5[["Pr(>Chisq)"]]),
    check.names = FALSE,
    stringsAsFactors = FALSE
  ),
  data.frame(
    Comparison = "7-constrained(4) vs 7-unconstrained",
    `Δdf`      = last4[["Df diff"]],
    `Δχ²`      = sprintf("%.1f", last4[["Chisq diff"]]),
    p          = fmt_p(last4[["Pr(>Chisq)"]]),
    check.names = FALSE,
    stringsAsFactors = FALSE
  )
)

kable(diff_tbl, caption = "Nested Model Comparisons (WLSMV DIFFTEST)")
Nested Model Comparisons (WLSMV DIFFTEST)
Comparison Δdf Δχ² p
7-constrained(5) vs 7-unconstrained 1 83.5 < .001
7-constrained(4) vs 7-unconstrained 2 163.2 < .001
# ---------- LOADINGS TABLE (Items only, no row numbers) ----------
std_sol <- standardizedSolution(fit7)
load_tbl <- std_sol[std_sol$op == "=~", c("rhs","est.std")]
colnames(load_tbl) <- c("Item","Loading")

# order however you like (here: by Item label)
ord <- order(load_tbl$Item)
load_tbl <- load_tbl[ord, , drop = FALSE]

# format numbers
load_tbl$Loading <- sprintf("%.3f", load_tbl$Loading)

# drop row names so kable doesn't print them
rownames(load_tbl) <- NULL

kable(load_tbl, align = c("l","r"),
      caption = "Standardized Loadings: 7-Factor CFA (WLSMV)")%>%
  kableExtra::kable_styling(full_width = FALSE)
Standardized Loadings: 7-Factor CFA (WLSMV)
Item Loading
A1 0.909
A2 0.773
A3 0.739
B1 0.917
B2 0.706
B3 0.870
C1 0.683
C2 0.817
C3 0.770
E1 0.736
E2 0.702
E3 0.794
P1 0.634
P2 0.870
P3 0.701
S1 0.823
S2 0.745
S3 0.774
V1 0.834
V2 0.828
V3 0.789
# ---------- LARGEST RESIDUAL CORRELATIONS ----------
res_mat <- lavResiduals(fit7, type = "cor")$cov
res_mat[lower.tri(res_mat, diag = TRUE)] <- NA
res_df <- as.data.frame(as.table(res_mat), stringsAsFactors = FALSE)
res_df <- res_df[!is.na(res_df$Freq), , drop = FALSE]
res_df$absr <- abs(res_df$Freq)
res_df <- res_df[order(-res_df$absr), , drop = FALSE]
top_n <- head(res_df, 12)

res_out <- data.frame(
  Item1 = top_n$Var1,
  Item2 = top_n$Var2,
  `Residual (cor)` = sprintf("%.3f", top_n$Freq),
  check.names = FALSE,
  stringsAsFactors = FALSE
)

kable(res_out, caption = "Largest Residual Correlations (absolute)")
Largest Residual Correlations (absolute)
Item1 Item2 Residual (cor)
A2 S1 0.112
B1 S1 0.102
A3 P3 0.098
V1 S1 0.095
C3 S2 -0.089
C2 S1 0.084
V2 E1 -0.084
A1 P1 -0.081
C3 S3 -0.077
C3 V3 -0.074
B3 S3 -0.073
E3 S1 0.073

Interpretation Guidelines

Commonly cited cutoff values for model fit indices in CFA/SEM are:

  • CFI/TLI > 0.90 (acceptable), > 0.95 (excellent)
  • RMSEA < 0.08 (acceptable), < 0.05 (excellent)
  • SRMR < 0.08 (good)

These thresholds are widely referenced in the structural equation modeling literature (Bentler, 1990; Hu & Bentler, 1999; Steiger, 1990).


Confirmatory Factor Analysis Results

A confirmatory factor analysis (CFA) was conducted using the WLSMV estimator to evaluate the hypothesized seven-factor structure of the scale. The seven latent factors corresponded to the theoretical domains: Cognitive, Affective, Physiological, Behavioral, Value, Self-Efficacy, and Social. All items were treated as ordered categorical indicators.

Model Fit

The initial seven-factor unconstrained model demonstrated excellent fit to the data, with CFI and TLI equal to 1.000, RMSEA close to zero, SRMR well below .05, and a nonsignificant chi-square test. These indices surpass conventional cutoffs for good model fit, e.g. (Hu & Bentler, 1999), supporting the adequacy of the proposed factor structure.

Two nested, more constrained models were tested to evaluate whether specific factor pairs could be collapsed. Constraining Affective and Physiological to correlate perfectly (ρ = 1) substantially worsened fit (CFI and TLI below .90, RMSEA ≈ .08, p < .001). Adding a further constraint equating Value and Self-Efficacy made the fit even poorer (CFI < .80, RMSEA > .10). Thus, collapsing these factors is not supported.

Nested Model Comparisons

WLSMV DIFFTESTs confirmed that both constrained models fit significantly worse than the unconstrained baseline. These findings indicate that the constrained models should be rejected in favor of the original seven-factor model, supporting the discriminant validity of the factors.

Standardized Factor Loadings

All items loaded significantly and substantively (λ ≥ .63), with most loadings exceeding .70. This provides strong evidence of convergent validity, indicating that the observed items reliably represent their intended latent constructs.

Residual Correlations

Inspection of the largest residual correlations revealed that all absolute values were modest (|r| ≤ .112). This suggests minimal local dependence among items after accounting for the latent factors, supporting the assumption of local independence.


✅ Summary Interpretation

  • The hypothesized seven-factor model shows excellent global fit.
  • Factor loadings are strong, supporting convergent validity.
  • Nested constraints degrade fit significantly, supporting discriminant validity.
  • Residual correlations are small, indicating no major misspecification.

Taken together, the CFA results provide compelling evidence that the seven latent factors represent distinct, reliable, and valid components of the construct.


1.9 Validity of Psychological Tests

1.9.1 Construct Validity

Definition: The extent to which a psychological test accurately measures the theoretical construct it is designed to assess, supported by empirical evidence and logical relationships with other variables.

Key Aspects:

  • Theoretical Alignment – The test should reflect the underlying construct’s definition and dimensions.
  • Empirical Support – Demonstrated through statistical analyses (e.g., factor analysis, MTMM).
  • Convergent & Discriminant Evidence – High correlations with related measures (convergent validity) and low correlations with unrelated ones (discriminant validity).
    • A test measuring "math anxiety" should:
      • Correlate highly with other math anxiety scales (convergent).
      • Show weak correlations with general anxiety scales (discriminant).

Key Methods:

  • Multitrait-Multimethod Matrix (MTMM)
    • Evaluates convergent and discriminant validity
    • Requires multiple measurement methods for multiple traits
  • Factor Analysis
    • Identifies underlying latent variables
      • Confirmatory (CFA) tests hypothesized structures
      • Exploratory (EFA) reveals emergent structures
  • Convergent-Discriminant Analysis
    • Convergent: High correlations with similar measures (\(r > 0.5\))
    • Discriminant: Low correlations with unrelated measures (\(r < 0.3\))

1.9.2 Content Validity

Definition: The extent to which items represent all facets of the construct.

Types:

Type Description Evaluation Method
Face Validity Superficial appearance of measuring the construct Stakeholder review
Logical Validity Systematic coverage of all construct domains Expert panel (Delphi method)

Evaluation Process:

  1. Define construct domains
  2. Map items to domains
  3. Calculate Content Validity Ratio (CVR):

1.9.3 Multitrait-Multimethod Matrix (MTMM)

The Multitrait-Multimethod Matrix (MTMM) is a systematic approach for evaluating construct validity through analysis of:

  • Convergent validity (evidence that different methods measure the same trait)
  • Discriminant validity (evidence that distinct traits remain differentiated)

In a Multitrait-Multimethod (MTMM) design, different methods refer to distinct measurement approaches (e.g., scales, raters, or instruments) used to assess the same traits. In a math anxiety scale, the traits (or constructs) would typically represent the theoretical components or dimensions of math anxiety, such as affective, cognitive, behavioral, physiological components.

Matrix Structure

Matrix Component Description Significance
Main diagonal Reliability coefficients (α/ω) or monotrait-monomethod correlations Internal consistency
Adjacent triangles Heterotrait-monomethod correlations Method-specific variance
Off-diagonal blocks Monotrait-heteromethod correlations Primary validity evidence

Campbell & Fiske (1959) Validity Criteria

For adequate construct validity, the matrix must demonstrate:

  • Convergence
    • High correlations (> 0.5) between measures of:
      • Same trait + Different methods (validity diagonals)
  • Discrimination
    • Validity diagonals exceed:
      • Heterotrait-monomethod correlations (same method, different traits)
      • Heterotrait-heteromethod correlations (different methods, different traits)
    • Pattern similarity between:
      • Monomethod-heterotrait triangles
      • Heteromethod-heterotrait triangles

This script simulates and visualizes a Multitrait–Multimethod (MTMM) correlation matrix for demonstration purposes.

# ================================
# Publication-ready MTMM heatmap
# ================================

plot_mtmm_matrix <- function(
  x,
  title = "Multitrait–Multimethod Matrix",
  show_diag = FALSE,          # hide self-correlations by default
  digits = 2,                 # decimals for coefficients
  label_cex = 0.9,            # coefficient font size
  palette = colorRampPalette(c("#67001f", "#f7f7f7", "#053061"))(200),
  mar = c(0, 0, 5, 1)         # outer margins
) {
  if (!requireNamespace("corrplot", quietly = TRUE))
    stop("Please install 'corrplot'.")

  # ---- Coerce to correlation matrix ----
  C <- if (is.data.frame(x)) {
    num <- x[, vapply(x, is.numeric, logical(1L)), drop = FALSE]
    if (ncol(num) < 2) stop("Need at least two numeric columns.")
    stats::cor(num, use = "pairwise.complete.obs")
  } else {
    as.matrix(x)
  }
  if (!show_diag) diag(C) <- NA

  # ---- Helpers ----
  to_idx <- function(r, ncols) {
    idx <- floor(((r + 1) / 2) * (ncols - 1)) + 1
    pmax(1, pmin(ncols, idx))
  }
  luminance01 <- function(hex) {
    rgb <- grDevices::col2rgb(hex) / 255
    as.numeric(0.2126 * rgb[1, ] + 0.7152 * rgb[2, ] + 0.0722 * rgb[3, ])
  }

  # ---- Base heatmap ----
  old_par <- par(mar = mar)
  on.exit(par(old_par), add = TRUE)

  corrplot::corrplot(
    C,
    method = "color",
    type = "upper",
    tl.col = "black",
    tl.srt = 45,
    mar = c(0, 0, 0, 0),
    title = title,
    col = palette,
    addCoef.col = NULL,   # we draw labels ourselves
    cl.ratio = 0.2,
    cl.align.text = "l",
    diag = show_diag
  )

  # ---- Overlay coefficients (upper triangle only) ----
  par(xpd = NA)  # avoid clipping at edges
  p <- ncol(C)
  ncols <- length(palette)

  for (i in 1:p) for (j in 1:p) {
    if (i < j && !is.na(C[i, j])) {
      r <- C[i, j]
      tile_col <- palette[to_idx(r, ncols)]
      dark_bg  <- luminance01(tile_col) < 0.5
      fg <- if (dark_bg) "white" else "black"

      # coordinates centered in tile
      x0 <- j
      y0 <- p - i + 1
      lab <- sprintf(paste0("%.", digits, "f"), r)

      text(x0, y0, labels = lab, col = fg, cex = label_cex, font = 2)
    }
  }
}
library(corrplot)

set.seed(123)
n <- 200
trait1 <- rnorm(n); trait2 <- rnorm(n); trait3 <- rnorm(n)
method1 <- rnorm(n, 0, 0.5); method2 <- rnorm(n, 0, 0.5); method3 <- rnorm(n, 0, 0.5)
dat <- data.frame(
  T1M1 = trait1 + method1 + rnorm(n, 0, 0.3),
  T1M2 = trait1 + method2 + rnorm(n, 0, 0.3),
  T1M3 = trait1 + method3 + rnorm(n, 0, 0.3),
  T2M1 = trait2 + method1 + rnorm(n, 0, 0.3),
  T2M2 = trait2 + method2 + rnorm(n, 0, 0.3),
  T2M3 = trait2 + method3 + rnorm(n, 0, 0.3),
  T3M1 = trait3 + method1 + rnorm(n, 0, 0.3),
  T3M2 = trait3 + method2 + rnorm(n, 0, 0.3),
  T3M3 = trait3 + method3 + rnorm(n, 0, 0.3)
)

plot_mtmm_matrix(dat, title = "Multitrait–Multimethod Matrix", show_diag = FALSE)


Interpreting Results:

  • Convergent validity: High correlations between different methods measuring the same trait (e.g., T1M1, T1M2, T1M3 should correlate strongly). These are called monotrait-heteromethod correlations and indicate convergent validity (different methods measuring the same trait agree).
  • Discriminant validity: Lower correlations between different traits measured by the same method (e.g., T1M1, T2M1, T3M1). These are called heterotrait-monomethod correlations and reflect method bias (shared variance due to the same measurement method). The lowest correlations should be between different traits measured with different methods (e.g., T1M1, T2M2). These are called heterotrait-heteromethod correlations and indicate discriminant validity (traits are distinct).
  • Method effects: High correlations between different traits measured by the same method suggest strong method effects

1.9.4 Discriminant Validity

# ===========================================================
# Discriminant Validity: HTMT, Fornell–Larcker, Cross-Loadings
# ===========================================================

# ---------------------------
# 0) PACKAGES
# ---------------------------
required_packages <- c("lavaan", "semTools", "dplyr", "tidyr")

install_if_missing <- function(pkg) {
  if (!requireNamespace(pkg, quietly = TRUE)) install.packages(pkg)
}

invisible(
  suppressPackageStartupMessages(
    lapply(required_packages, require, character.only = TRUE, quietly = TRUE, warn.conflicts = FALSE)
  )
)

# ---------------------------
# 1) DATA
# ---------------------------
# Expect a CSV with columns: C1–C3, A1–A3, P1–P3, B1–B3, V1–V3, E1–E3, S1–S3
final_data <- read.csv("final_data.csv", header = TRUE, stringsAsFactors = FALSE)

# ---------------------------
# 2) CFA MODEL & FIT
# ---------------------------
cfa_model <- '
  Cognitive       =~ C1 + C2 + C3
  Affective       =~ A1 + A2 + A3
  Physiological   =~ P1 + P2 + P3
  Behavioral      =~ B1 + B2 + B3
  Value           =~ V1 + V2 + V3
  Self_Efficacy   =~ E1 + E2 + E3
  Social          =~ S1 + S2 + S3
'

# FIML is recommended if missingness exists; remove "missing" if you prefer listwise
fit <- lavaan::cfa(cfa_model, data = final_data, std.lv = TRUE, missing = "fiml")

# ---------------------------
# 3) METHOD 1: HTMT (robust to semTools version)
#    Use the (model + data) interface to avoid S4 coercion errors.
# ---------------------------
htmt_matrix <- semTools::htmt(model = cfa_model, data = final_data)
htmt_vals   <- htmt_matrix[lower.tri(htmt_matrix)]
max_htmt    <- max(htmt_vals, na.rm = TRUE)
htmt_thresh <- 0.85
htmt_violations <- sum(htmt_vals > htmt_thresh, na.rm = TRUE)

# Optional: list violating pairs
get_lower_pairs <- function(mat) {
  dn <- dimnames(mat)
  pairs <- which(lower.tri(mat), arr.ind = TRUE)
  data.frame(
    Factor_i = dn[[1]][pairs[, 1]],
    Factor_j = dn[[2]][pairs[, 2]],
    HTMT     = mat[lower.tri(mat)],
    stringsAsFactors = FALSE
  )
}
htmt_pairs_df <- get_lower_pairs(htmt_matrix) |>
  dplyr::mutate(Violation = HTMT > htmt_thresh)

# ---------------------------
# 4) METHOD 2: Fornell–Larcker
#    √AVE for each factor should exceed |latent correlation| for every other factor.
#    Count each factor pair ONCE.
# ---------------------------
ave_values <- semTools::AVE(fit)                      # named numeric vector
lv_cor     <- lavaan::lavInspect(fit, "cor.lv")       # latent correlation matrix

comb_pairs <- utils::combn(names(ave_values), 2, simplify = FALSE)

fl_pairs_df <- dplyr::bind_rows(lapply(comb_pairs, function(p) {
  i <- p[1]; j <- p[2]
  sqrtAVE_i <- sqrt(ave_values[i])
  sqrtAVE_j <- sqrt(ave_values[j])
  phi_ij    <- abs(lv_cor[i, j])
  violation <- (sqrtAVE_i <= phi_ij) || (sqrtAVE_j <= phi_ij)
  data.frame(Factor_i = i, Factor_j = j,
             sqrtAVE_i = sqrtAVE_i, sqrtAVE_j = sqrtAVE_j,
             abs_phi   = phi_ij, Violation = violation,
             stringsAsFactors = FALSE)
}))
fl_violations <- sum(fl_pairs_df$Violation, na.rm = TRUE)

# ---------------------------
# 5) METHOD 3: Cross-Loadings
#    Use the standardized loading matrix (zeros for fixed paths).
#    Flag items where (primary − secondary) < 0.20
# ---------------------------
Std    <- lavaan::lavInspect(fit, "std")
Lambda <- Std$lambda                              # items × factors (std)

# Safety: remove rows that are completely NA (shouldn't happen in well-posed models)
Lambda <- Lambda[rowSums(is.na(Lambda)) < ncol(Lambda), , drop = FALSE]

absL <- abs(Lambda)
primary_idx   <- apply(absL, 1, which.max)
primary_val   <- apply(absL, 1, max, na.rm = TRUE)
secondary_val <- apply(absL, 1, function(x) { x[which.max(x)] <- -Inf; max(x, na.rm = TRUE) })

cross_df <- data.frame(
  Item            = rownames(Lambda),
  PrimaryFactor   = colnames(Lambda)[primary_idx],
  PrimaryLoading  = primary_val,
  SecondaryLoading= secondary_val,
  Difference      = primary_val - secondary_val,
  stringsAsFactors = FALSE
)

diff_thresh <- 0.20
cross_df$Violation <- cross_df$Difference < diff_thresh
mean_diff          <- mean(cross_df$Difference, na.rm = TRUE)
n_violations_cross <- sum(cross_df$Violation, na.rm = TRUE)

# ---------------------------
# 6) FINAL REPORT
# ---------------------------
cat("DISCRIMINANT VALIDITY ASSESSMENT\n")
## DISCRIMINANT VALIDITY ASSESSMENT
cat("================================\n\n")
## ================================
cat("1) HTMT\n")
## 1) HTMT
cat("   - Max HTMT:", round(max_htmt, 3), "\n")
##    - Max HTMT: 0.53
cat("   - Threshold:", htmt_thresh, "\n")
##    - Threshold: 0.85
cat("   - Violations (# pairs with HTMT >", htmt_thresh, "): ", htmt_violations, "\n\n", sep = "")
##    - Violations (# pairs with HTMT >0.85): 0
cat("2) Fornell–Larcker\n")
## 2) Fornell–Larcker
cat("   - Violations (# pairs where |phi| ≥ √AVE_i or √AVE_j): ", fl_violations, "\n\n", sep = "")
##    - Violations (# pairs where |phi| ≥ √AVE_i or √AVE_j): 0
cat("3) Cross-Loadings\n")
## 3) Cross-Loadings
cat("   - Mean (primary − secondary): ", round(mean_diff, 3), "\n", sep = "")
##    - Mean (primary − secondary): 0.749
cat("   - Items with difference <", diff_thresh, ": ", n_violations_cross, "\n\n", sep = "")
##    - Items with difference <0.2: 0
# Optional: print detailed rows if violations exist
if (htmt_violations > 0) {
  cat("HTMT Violating Pairs:\n")
  print(dplyr::filter(htmt_pairs_df, Violation) |>
          dplyr::arrange(dplyr::desc(HTMT)))
  cat("\n")
}
if (fl_violations > 0) {
  cat("Fornell–Larcker Violations:\n")
  print(dplyr::filter(fl_pairs_df, Violation) |>
          dplyr::arrange(dplyr::desc(abs_phi)))
  cat("\n")
}
if (n_violations_cross > 0) {
  cat("Items with Cross-Loading Gap <", diff_thresh, ":\n", sep = "")
  print(dplyr::filter(cross_df, Violation) |>
          dplyr::arrange(Difference))
  cat("\n")
}

Discriminant validity was evaluated using the HTMT ratio, Fornell–Larcker criterion, and cross-loadings. The HTMT values were well below the recommended threshold, with the maximum HTMT observed at 0.53, far lower than the conservative cutoff of 0.85 (Henseler et al., 2015), indicating no problematic overlap between constructs. The Fornell–Larcker analysis revealed no violations; in all cases, the square root of each construct’s AVE exceeded its correlations with other constructs, confirming that constructs shared more variance with their own items than with others. Examination of cross-loadings further supported discriminant validity, with a mean difference of 0.749 between primary and secondary loadings and no items exhibiting a difference smaller than 0.20. Taken together, these results provide strong evidence that the latent constructs are empirically distinct and well-differentiated from one another.


1.9.5 Convergent Validity

standardized_loadings <- standardizedSolution(fit) %>% 
  dplyr::filter(op == "=~") %>%
  dplyr::select(Item = rhs, Loading = est.std)

# Display with kable (make sure knitr is loaded)
knitr::kable(standardized_loadings, digits = 3)%>%
  kableExtra::kable_styling(full_width = FALSE)
Item Loading
C1 0.660
C2 0.766
C3 0.744
A1 0.864
A2 0.750
A3 0.711
P1 0.626
P2 0.842
P3 0.642
B1 0.870
B2 0.687
B3 0.838
V1 0.775
V2 0.815
V3 0.758
E1 0.744
E2 0.670
E3 0.727
S1 0.739
S2 0.731
S3 0.770

Interpretation Guidelines:

Loading Range Interpretation Recommendation
≥ 0.70 Excellent Keep item
0.50 - 0.69 Acceptable Consider revision
< 0.50 Problematic Remove/revise

Convergent validity was supported by the magnitude of standardized factor loadings. Most items loaded strongly on their intended constructs, with the majority exceeding the recommended cutoff of .70 (Fornell & Larcker, 1981; Hair et al., 2019). For example, affective items loaded between .71 and .86, behavioral items ranged from .69 to .87, and value items loaded between .76 and .82. Although a few indicators were slightly below .70 (e.g., C1 = .66, P1 = .63, P3 = .64, E2 = .67), all exceeded the minimum acceptable threshold of .60, suggesting that they still contribute meaningfully to their respective constructs. Taken together, these results provide evidence of convergent validity, indicating that the items converge on their hypothesized latent constructs.


Composite Reliability (CR)

# Using updated semTools function
semTools::compRelSEM(fit)
##     Cognitive     Affective Physiological    Behavioral         Value 
##         0.769         0.822         0.754         0.841         0.827 
## Self_Efficacy        Social 
##         0.757         0.791

Convergent validity was supported by both standardized factor loadings and composite reliability coefficients. All items loaded significantly on their intended constructs, with most standardized loadings exceeding the recommended cutoff of .70 and the remainder above .60. Composite reliability (CR) estimates further confirmed internal consistency, with values ranging from .75 to .84 across constructs (Cognitive = .77, Affective = .82, Physiological = .75, Behavioral = .84, Value = .83, Self-Efficacy = .76, Social = .79). All CR values exceeded the .70 benchmark (Nunnally & Bernstein, 1994); (Hair et al., 2019), indicating that the constructs demonstrate adequate convergent validity.


Average Variance Extracted (AVE)

# Using current AVE function
semTools::AVE(fit)
##     Cognitive     Affective Physiological    Behavioral         Value 
##         0.528         0.608         0.515         0.642         0.614 
## Self_Efficacy        Social 
##         0.509         0.556

Convergent validity was further assessed using the Average Variance Extracted (AVE). All constructs demonstrated AVE values above the recommended threshold of .50 (Fornell & Larcker, 1981), ranging from .51 to .64 (Cognitive = .53, Affective = .61, Physiological = .52, Behavioral = .64, Value = .61, Self-Efficacy = .51, Social = .56). These results indicate that each latent construct explains more than half of the variance in its associated indicators, thereby providing additional evidence of convergent validity.


Check Overall Model Fit

# ===========================================================
# Model Fit Statistics (Improved Formatting)
# ===========================================================

fit_measures <- fitMeasures(fit, c("cfi","tli","rmsea","srmr")) %>% 
  round(3) %>% 
  as.data.frame() %>%
  rename(Value = 1) %>%
  mutate(
    Threshold = case_when(
      rownames(.) == "cfi" ~ ">0.95",
      rownames(.) == "tli" ~ ">0.95",
      rownames(.) == "rmsea" ~ "<0.06",
      rownames(.) == "srmr" ~ "<0.08"
    ),
    Interpretation = case_when(
      (rownames(.) %in% c("cfi","tli") & Value >= 0.95) ~ "Excellent",
      (rownames(.) %in% c("cfi","tli") & Value >= 0.90) ~ "Acceptable",
      (rownames(.) == "rmsea" & Value <= 0.06) ~ "Excellent",
      (rownames(.) == "rmsea" & Value <= 0.08) ~ "Acceptable",
      (rownames(.) == "srmr" & Value <= 0.08) ~ "Excellent",
      TRUE ~ "Poor"
    )
  )

# Create the table without trailing bars
knitr::kable(fit_measures, 
             caption = "Model Fit Indices with Interpretation",
             align = c("l","r","c","l"),
             col.names = c("Fit Index", "Value", "Threshold", "Interpretation"),
             format = "html") %>%  # Explicitly specify HTML output
  kableExtra::kable_styling(
    bootstrap_options = c("striped", "hover"),
    full_width = FALSE,
    position = "left"
  ) %>%
  kableExtra::row_spec(which(fit_measures$Interpretation == "Poor"), 
                       background = "#FFDDDD") %>%
  kableExtra::row_spec(which(fit_measures$Interpretation == "Acceptable"), 
                       background = "#FFF3CD") %>%
  kableExtra::column_spec(4, width = "3cm")  # Adjust interpretation column width
Model Fit Indices with Interpretation
Fit Index Value Threshold Interpretation
cfi 1.000 >0.95 Excellent
tli 1.002 >0.95 Excellent
rmsea 0.000 <0.06 Excellent
srmr 0.034 <0.08 Excellent

Acceptable Thresholds:

  • CFI ≥ 0.90 (≥ 0.95 ideal)
  • TLI ≥ 0.90 (≥ 0.95 ideal)
  • RMSEA ≤ 0.08 (≤ 0.06 ideal)
  • SRMR ≤ 0.08 (≤ 0.06 ideal)

Overall Assessment Framework

Evidence Level Criteria Indicators
Strong Evidence - All criteria met - ✓ Most loadings ≥ 0.70
- ✓ All CR ≥ 0.80
- ✓ All AVE ≥ 0.50
Moderate Evidence️ - Some criteria partially met - Some loadings 0.50 - 0.69
- Some CR 0.70 - 0.79
- Some AVE 0.40 - 0.49
Weak Evidence - Multiple criteria not met - Multiple loadings < 0.50
- CR values < 0.70
- AVE values < 0.40

The measurement model demonstrated:

  • Strong convergent validity (all loadings > 0.70, CR > 0.80, AVE > 0.50)
  • Excellent discriminant validity (HTMT < 0.85, Fornell-Larcker criterion met)
  • Good model fit (all fit indices within recommended thresholds)

2 Classical Test Theory (CTT)

CTT, sometimes referred to as weak theory or true-score theory, is one of the foundational frameworks in psychometrics, designed to analyze the properties of psychological and educational tests with an emphasis on understanding item functioning, test-taker ability, and measurement error (Crocker & Algina, 1986; Lord & Novick, 1968). At the heart of CTT is the assumption that every examinee has a true score, which represents their expected performance if the same test were administered an infinite number of times under identical conditions with no error (Allen & Yen, 2002a; Novick, 1966). In practice, what is observed is not this true score but an observed score, which reflects both the true score and a random error component.

A key contribution of CTT lies in its ability to model the influence of error on test reliability, enabling the estimation of how consistently a test measures the construct of interest across administrations (Haertel, 2006). Reliability coefficients such as Cronbach’s α, split-half reliability, and test–retest reliability are rooted in CTT, providing practical tools for evaluating the dependability of test scores. CTT also informs indices of item difficulty and item discrimination, which are used to refine instruments and ensure that test items adequately represent the targeted ability range of examinees.

Despite its strengths and longstanding use, CTT rests on simplifying assumptions that limit its generalizability. For example, reliability is sample-dependent, meaning estimates can vary substantially across different populations. In addition, CTT treats measurement error as undifferentiated, without distinguishing between sources of error (e.g., systematic bias versus random fluctuation), and item statistics are tied directly to the test form and group on which they were estimated. These limitations constrain the comparability of test results across different forms or populations.

Modern approaches such as Item Response Theory (IRT) extend and refine CTT by modeling the probability of a correct response as a function of an examinee’s latent trait level and item parameters (e.g., difficulty, discrimination, and guessing). Unlike CTT, IRT provides item-level precision, allows for test equating across populations, and supports adaptive testing by tailoring item selection to individual ability estimates (Traub, 1997). Nevertheless, CTT remains widely used due to its conceptual simplicity, relatively modest sample size requirements, and ease of computation, continuing to serve as a cornerstone of test development and validation in both educational and psychological measurement contexts.


The foundational equation of CTT defines the observed score:

\[ X = \tau + \varepsilon \]

where:

\(\quad X\) = observed test score

\(\quad \tau\) = true score (latent ability)

\(\quad \varepsilon\) = measurement error


2.1 Core Assumptions of CTT

  1. Expected Value Condition

    \[E[X] = \tau\]

    • The true score (\(\tau\)) equals the expected value of observed scores (\(X\)) over infinite administrations (Allen & Yen, 2002b, p. 45).
    • Implies measurement errors average to zero: \(E[\varepsilon] = 0\)
  2. Error-True Score Independence

    \[\rho(\varepsilon, \tau) = 0\]

    • Measurement errors (\(\varepsilon\)) are uncorrelated with true scores (Feldt & Brennan, 1989, Ch. 4)
    • Critical for unbiased estimation of reliability
  3. Cross-Test Error Independence

    \[\rho(\varepsilon_1, \varepsilon_2) = 0\]

    • Errors between different tests are uncorrelated (Haertel, 2006)
    • Violations occur with fatigue or practice effects
  4. Error-Other True Score Independence

    \[\rho(\varepsilon_1, \tau_2) = 0\]

    • Errors on Test 1 are uncorrelated with true scores on Test 2
    • Ensures discriminant validity between constructs

2.2 Special Test Relationships

  1. Parallel Tests

    Two tests \(X\) and \(X'\) are parallel if they satisfy:

    1. All core assumptions (1-4)
    2. \(\tau = \tau'\) (equal true scores)
    3. \(\sigma^2_\varepsilon = \sigma^2_{\varepsilon'}\) (equal error variances)

    Implications (Traub, 1997):

    • Measure identical constructs with equal precision
    • Produce identical observed score distributions
    • Required for test-retest reliability estimation
  2. Essentially τ-Equivalent Tests

    Two tests \(X_1\) and \(X_2\) are essentially τ-equivalent if:

    1. All core assumptions (1-4)
    2. \(\tau_1 = \tau_2 + c\) (true scores differ by a constant)

    Key Properties (Crocker & Algina, 1986, p. 112):

    • Measure the same construct on linearly related scales
    • Allow different error variances (\(\sigma^2_{\varepsilon_1} \neq \sigma^2_{\varepsilon_2}\))
    • Form the basis for Cronbach's alpha reliability

2.3 Dichotomous Test Data Simulation

Dichotomous test data analysis involves assessments where items have only two possible responses (e.g., correct/incorrect, true/false, yes/no). This is common in standardized testing, certification exams, and surveys. Below is a structured approach to analyzing dichotomous test data:


Simulation of Multiple-Choice Test Response Data

This script defines a function called simulate_test_data() that generates artificial polytomous test data using the Graded Response Model (GRM).

#' Simulate polytomous test data using the Graded Response Model (GRM)
#'
#' Generates ordinal responses for a multiple-category test under a GRM,
#' returning both numeric categories (1..K) and letter options (A,B,...).
#'
#' @param n_items Integer. Number of items (default = 30)
#' @param n_examinees Integer. Number of examinees (default = 1000)
#' @param n_categories Integer. Number of response categories per item (K >= 2; default = 5)
#' @param set_seed Optional integer random seed for reproducibility (default = 123)
#'
#' @return A list with:
#' \itemize{
#'   \item \code{response_matrix} \{matrix\}: Ordinal responses (1..K), dimnames set
#'   \item \code{response_letters} \{data.frame\}: Same responses as letters (A..)
#'   \item \code{true_abilities} \{numeric\}: Vector of latent thetas
#'   \item \code{item_discriminations} \{numeric\}: a-parameters (length = n_items)
#'   \item \code{item_thresholds} \{list\}: length (K-1) threshold vectors per item
#'   \item \code{answer_key} \{named character\}: Correct option per item (highest category)
#' }
#'
#' @details
#' For item j with discrimination \(a_j\) and ordered thresholds \(\mathbf{b}_j\),
#' category probabilities are computed from cumulative logits and then differenced.
#' Small numerical negatives are truncated at 0 and re-normalized to sum to 1.
#'
#' @export
simulate_test_data <- function(n_items = 30,
                               n_examinees = 1000,
                               n_categories = 5,
                               set_seed = 123) {
  # ---- validation ----
  stopifnot(is.numeric(n_items), n_items >= 1, n_items == as.integer(n_items))
  stopifnot(is.numeric(n_examinees), n_examinees >= 1, n_examinees == as.integer(n_examinees))
  stopifnot(is.numeric(n_categories), n_categories >= 2, n_categories == as.integer(n_categories))

  if (!is.null(set_seed)) set.seed(as.integer(set_seed))

  # ---- latent abilities ----
  theta <- rnorm(n_examinees)

  # ---- item params ----
  a <- rlnorm(n_items, meanlog = 0, sdlog = 0.4) + 0.5
  b_list <- lapply(seq_len(n_items), function(i) sort(rnorm(n_categories - 1, 0, 1)))

  # ---- containers ----
  item_names <- paste0("Item", seq_len(n_items))
  option_letters <- LETTERS[seq_len(n_categories)]

  responses <- matrix(NA_integer_, nrow = n_examinees, ncol = n_items,
                      dimnames = list(NULL, item_names))
  responses_letter <- matrix(NA_character_, nrow = n_examinees, ncol = n_items,
                             dimnames = list(NULL, item_names))

  # ---- simulate responses ----
  for (j in seq_len(n_items)) {
    thresholds <- b_list[[j]]
    aj <- a[j]

    for (i in seq_len(n_examinees)) {
      # cumulative category probs (P(Y >= k))
      P_cum <- plogis(aj * (theta[i] - thresholds))  # length K-1, strictly increasing in k
      # convert to category probs
      P_full <- numeric(n_categories)
      P_full[1] <- 1 - P_cum[1]
      if (n_categories > 2) {
        for (k in 2:(n_categories - 1)) {
          P_full[k] <- P_cum[k - 1] - P_cum[k]
        }
      }
      P_full[n_categories] <- P_cum[n_categories - 1]

      # numerical guard: truncate tiny negatives and renormalize
      P_full <- pmax(P_full, 0)
      s <- sum(P_full)
      if (!is.finite(s) || s <= 0) {
        # extremely rare; fall back to uniform
        P_full <- rep(1 / n_categories, n_categories)
      } else {
        P_full <- P_full / s
      }

      category <- sample.int(n_categories, size = 1, prob = P_full)
      responses[i, j] <- category
      responses_letter[i, j] <- option_letters[category]
    }
  }

  # ---- answer key: highest category is keyed correct ----
  answer_key <- setNames(rep(option_letters[n_categories], n_items), item_names)

  # return letters as data.frame for convenient write.csv/use, no factors
  response_letters_df <- as.data.frame(responses_letter, stringsAsFactors = FALSE)

  list(
    response_matrix       = responses,
    response_letters      = response_letters_df,
    true_abilities        = as.numeric(theta),
    item_discriminations  = as.numeric(a),
    item_thresholds       = b_list,
    answer_key            = answer_key
  )
}

# Save to a file
dump("simulate_test_data", file = "simulate_test_data.R")
# Example usage
set.seed(123)
test_data <- simulate_test_data()
# boxplot(grm_data$response_matrix)

# Save response matrix with letter options
write.csv(test_data$response_letters, "grm_letter_responses.csv", row.names = FALSE)

# Save numeric response matrix (optional)
write.csv(test_data$response_matrix, "grm_numeric_responses.csv", row.names = FALSE)

# Save correct answer key
answer_key_df <- data.frame(
  Item = names(test_data$answer_key),
  Correct_Option = test_data$answer_key
)
write.csv(answer_key_df, "grm_answer_key.csv", row.names = FALSE)
# Load letter responses
responses_letter <- read.csv("grm_letter_responses.csv")

# Load numeric responses
responses_numeric <- read.csv("grm_numeric_responses.csv")

# Load answer key
answer_key <- read.csv("grm_answer_key.csv")

# Verify structure
cat("\nResponse Data Structure:\n")
## 
## Response Data Structure:
str(responses_letter, vec.len = 2, give.attr = FALSE)
## 'data.frame':    1000 obs. of  30 variables:
##  $ Item1 : chr  "E" "C" ...
##  $ Item2 : chr  "B" "B" ...
##  $ Item3 : chr  "A" "E" ...
##  $ Item4 : chr  "E" "D" ...
##  $ Item5 : chr  "D" "A" ...
##  $ Item6 : chr  "C" "B" ...
##  $ Item7 : chr  "A" "E" ...
##  $ Item8 : chr  "B" "B" ...
##  $ Item9 : chr  "B" "A" ...
##  $ Item10: chr  "A" "C" ...
##  $ Item11: chr  "B" "D" ...
##  $ Item12: chr  "A" "A" ...
##  $ Item13: chr  "A" "B" ...
##  $ Item14: chr  "C" "C" ...
##  $ Item15: chr  "D" "D" ...
##  $ Item16: chr  "B" "C" ...
##  $ Item17: chr  "E" "B" ...
##  $ Item18: chr  "A" "A" ...
##  $ Item19: chr  "A" "C" ...
##  $ Item20: chr  "A" "B" ...
##  $ Item21: chr  "A" "C" ...
##  $ Item22: chr  "A" "A" ...
##  $ Item23: chr  "D" "D" ...
##  $ Item24: chr  "B" "A" ...
##  $ Item25: chr  "A" "E" ...
##  $ Item26: chr  "A" "A" ...
##  $ Item27: chr  "D" "D" ...
##  $ Item28: chr  "A" "C" ...
##  $ Item29: chr  "A" "A" ...
##  $ Item30: chr  "B" "E" ...
cat("\nKey Structure (now proper named list):\n")
## 
## Key Structure (now proper named list):
str(answer_key, vec.len = 5, give.attr = FALSE)
## 'data.frame':    30 obs. of  2 variables:
##  $ Item          : chr  "Item1" "Item2" "Item3" "Item4" "Item5" ...
##  $ Correct_Option: chr  "E" "E" "E" "E" "E" ...

Convert multi-category responses to binary (0/1) format for:

This code takes categorical item responses (letters like A, B, C, D, E), converts them into numeric values.

library(dplyr)   # or library(tidyverse)

# Convert to proper format if needed
if(!is.matrix(responses_letter) && !is.data.frame(responses_letter)) {
  responses_letter <- as.matrix(responses_letter)
}

# Define your scoring scheme (adjust as needed)
scoring_key <- c("A" = 1, "B" = 2, "C" = 3, "D" = 4, "E" = 5)

# Convert all responses to numeric
responses_numeric <- responses_letter %>%
  mutate(across(everything(), ~scoring_key[as.character(.)]))

# Ensure column names exist
if(is.null(colnames(responses_numeric))) {
  colnames(responses_numeric) <- paste0("Item_", seq_len(ncol(responses)))
}
library(kableExtra)

# Compute n per item (non-missing counts)
n_counts <- colSums(!is.na(responses_numeric))

# Build footnote text
n_note <- if (length(unique(n_counts)) == 1) {
  paste0("Note: n = ", unique(n_counts))
} else {
  paste0("n varies by item: median = ", stats::median(n_counts),
         " (min = ", min(n_counts), ", max = ", max(n_counts), ")")
}

# Create statistics data frame
response_stats <- tryCatch({
  data.frame(
    Item = colnames(responses_numeric),
    Mean = colMeans(responses_numeric, na.rm = TRUE),
    SD   = apply(responses_numeric, 2, sd, na.rm = TRUE),
    row.names = NULL
  )
}, error = function(e) {
  message("Error creating stats table: ", e$message)
  NULL
})

# Render table with footnote
if (!is.null(response_stats)) {
  response_stats |>
    kableExtra::kable("html", caption = "Response Summary", digits = 2) |>
    kableExtra::kable_styling(full_width = FALSE) |>
    kableExtra::column_spec(1:3, width = "1.5cm") |>
    kableExtra::footnote(general = n_note)
} else {
  message("Could not create response statistics table")
}
Response Summary
Item Mean SD
Item1 3.50 1.64
Item2 2.87 1.62
Item3 2.69 1.37
Item4 3.22 1.66
Item5 3.06 1.52
Item6 3.13 1.61
Item7 2.70 1.51
Item8 2.38 1.34
Item9 2.74 1.40
Item10 2.83 1.41
Item11 3.44 1.48
Item12 2.98 1.67
Item13 2.76 1.54
Item14 3.42 1.34
Item15 3.48 1.36
Item16 2.65 1.14
Item17 2.67 1.62
Item18 2.50 1.73
Item19 3.68 1.50
Item20 3.59 1.45
Item21 2.72 1.26
Item22 2.70 1.62
Item23 3.66 0.97
Item24 2.38 1.11
Item25 3.30 1.65
Item26 3.06 1.64
Item27 3.40 1.36
Item28 2.88 1.54
Item29 3.10 1.77
Item30 3.01 1.46
Note:
Note: n = 1000

This R function converts polytomous (multiple-choice) test responses into dichotomous (right/wrong) scoring based on a provided answer key.

dichotomize <- function(data, key, na_to_zero = FALSE) {
  # Convert to data frame if it's a matrix
  if (is.matrix(data)) data <- as.data.frame(data)
  
  # Initialize output
  dich_data <- data.frame(matrix(NA, nrow = nrow(data), ncol = ncol(data)))
  colnames(dich_data) <- colnames(data)
  
  # Go item by item
  for (item_name in names(key)) {
    if (!item_name %in% names(data)) {
      warning(sprintf("Item '%s' not found in data", item_name))
      next
    }
    
    correct_values <- key[[item_name]]
    response_values <- data[[item_name]]
    dich_data[[item_name]] <- as.integer(response_values %in% correct_values)
    
    # Optional NA handling
    if (na_to_zero) {
      dich_data[[item_name]][is.na(dich_data[[item_name]])] <- 0
    }
    
    # Check if all NA (e.g., no matches)
    if (all(is.na(dich_data[[item_name]]))) {
      warning(sprintf("No correct matches for item '%s'", item_name))
      cat("  Key values:", paste(correct_values, collapse = ", "), "\n")
      cat("  Response values:", paste(unique(response_values), collapse = ", "), "\n")
    }
  }
  
  return(dich_data)
}


# Save to a file
dump("dichotomize", file = "dichotomize.R")
# Fix column names in responses_letter to match key
colnames(responses_letter) <- paste0("Item", 1:ncol(responses_letter))

# Convert answer_key to named list
answer_key_list <- setNames(as.list(answer_key$Correct_Option), answer_key$Item)

# Now apply dichotomization
dich_data <- dichotomize(responses_letter, answer_key_list)
str(dich_data)
## 'data.frame':    1000 obs. of  30 variables:
##  $ Item1 : int  1 0 1 0 0 1 0 1 0 0 ...
##  $ Item2 : int  0 0 1 1 0 1 0 0 0 0 ...
##  $ Item3 : int  0 1 0 1 0 0 0 0 0 0 ...
##  $ Item4 : int  1 0 1 1 0 1 0 0 0 0 ...
##  $ Item5 : int  0 0 1 0 0 1 1 0 0 0 ...
##  $ Item6 : int  0 0 1 0 0 1 1 0 0 0 ...
##  $ Item7 : int  0 1 1 1 0 1 0 0 0 0 ...
##  $ Item8 : int  0 0 1 0 0 1 0 0 0 0 ...
##  $ Item9 : int  0 0 1 0 0 1 0 0 0 0 ...
##  $ Item10: int  0 0 0 0 0 0 1 0 0 0 ...
##  $ Item11: int  0 0 1 1 0 1 0 0 0 0 ...
##  $ Item12: int  0 0 1 0 1 1 0 0 0 0 ...
##  $ Item13: int  0 0 0 0 0 1 0 0 0 0 ...
##  $ Item14: int  0 0 1 0 0 1 1 0 0 0 ...
##  $ Item15: int  0 0 1 0 0 1 0 0 0 0 ...
##  $ Item16: int  0 0 1 0 0 1 0 1 0 0 ...
##  $ Item17: int  1 0 1 0 0 0 0 0 0 0 ...
##  $ Item18: int  0 0 1 0 0 1 0 0 0 0 ...
##  $ Item19: int  0 0 1 1 0 1 1 0 1 0 ...
##  $ Item20: int  0 0 1 0 0 0 1 0 0 0 ...
##  $ Item21: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Item22: int  0 0 1 0 0 0 0 0 0 0 ...
##  $ Item23: int  0 0 0 0 0 1 0 0 0 0 ...
##  $ Item24: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Item25: int  0 1 1 0 1 1 1 0 0 0 ...
##  $ Item26: int  0 0 0 0 0 1 0 0 0 0 ...
##  $ Item27: int  0 0 1 0 0 1 0 0 0 0 ...
##  $ Item28: int  0 0 1 0 0 1 0 0 0 0 ...
##  $ Item29: int  0 0 1 1 1 1 1 0 0 1 ...
##  $ Item30: int  0 1 1 0 0 1 0 0 0 0 ...

This block is a diagnostic routine to verify that your dichotomized (0/1) scoring worked correctly when comparing categorical responses against an answer key.

# Check all items and print mismatches/NA issues
for (item in names(answer_key_list)) {
  cat("\n=== Item:", item, "===\n")
  cat("Key:", paste(answer_key_list[[item]], collapse = ", "), "\n")
  
  # Cross-tabulate original vs. dichotomized
  tbl <- table(
    "Original" = responses_letter[[item]],
    "Dichotomized" = dich_data[[item]],
    useNA = "ifany"
  )
  print(tbl)
  
  # Check for all-NA dichotomization (potential key mismatch)
  if (all(is.na(dich_data[[item]]))) {
    warning("! All NA results for ", item, " — key may not match any responses!")
  }
  
  # Check if some correct responses are coded as 0 (key issue)
  correct_responses <- responses_letter[[item]] %in% answer_key_list[[item]]
  if (any(dich_data[[item]][correct_responses] != 1, na.rm = TRUE)) {
    warning("! Some correct responses in ", item, " were NOT coded as 1!")
  }
}
## 
## === Item: Item1 ===
## Key: E 
##         Dichotomized
## Original   0   1
##        A 251   0
##        B  33   0
##        C 111   0
##        D 172   0
##        E   0 433
## 
## === Item: Item2 ===
## Key: E 
##         Dichotomized
## Original   0   1
##        A 263   0
##        B 301   0
##        C  35   0
##        D 107   0
##        E   0 294
## 
## === Item: Item3 ===
## Key: E 
##         Dichotomized
## Original   0   1
##        A 250   0
##        B 276   0
##        C 134   0
##        D 217   0
##        E   0 123
## 
## === Item: Item4 ===
## Key: E 
##         Dichotomized
## Original   0   1
##        A 242   0
##        B 190   0
##        C  51   0
##        D 138   0
##        E   0 379
## 
## === Item: Item5 ===
## Key: E 
##         Dichotomized
## Original   0   1
##        A 257   0
##        B 144   0
##        C  96   0
##        D 286   0
##        E   0 217
## 
## === Item: Item6 ===
## Key: E 
##         Dichotomized
## Original   0   1
##        A 214   0
##        B 221   0
##        C 153   0
##        D  42   0
##        E   0 370
## 
## === Item: Item7 ===
## Key: E 
##         Dichotomized
## Original   0   1
##        A 293   0
##        B 249   0
##        C 128   0
##        D 122   0
##        E   0 208
## 
## === Item: Item8 ===
## Key: E 
##         Dichotomized
## Original   0   1
##        A 298   0
##        B 401   0
##        C  33   0
##        D 163   0
##        E   0 105
## 
## === Item: Item9 ===
## Key: E 
##         Dichotomized
## Original   0   1
##        A 242   0
##        B 210   0
##        C 322   0
##        D  20   0
##        E   0 206
## 
## === Item: Item10 ===
## Key: E 
##         Dichotomized
## Original   0   1
##        A 244   0
##        B 221   0
##        C 144   0
##        D 245   0
##        E   0 146
## 
## === Item: Item11 ===
## Key: E 
##         Dichotomized
## Original   0   1
##        A 176   0
##        B 130   0
##        C  77   0
##        D 310   0
##        E   0 307
## 
## === Item: Item12 ===
## Key: E 
##         Dichotomized
## Original   0   1
##        A 345   0
##        B  43   0
##        C 231   0
##        D  53   0
##        E   0 328
## 
## === Item: Item13 ===
## Key: E 
##         Dichotomized
## Original   0   1
##        A 345   0
##        B 144   0
##        C  84   0
##        D 264   0
##        E   0 163
## 
## === Item: Item14 ===
## Key: E 
##         Dichotomized
## Original   0   1
##        A 114   0
##        B 116   0
##        C 323   0
##        D 125   0
##        E   0 322
## 
## === Item: Item15 ===
## Key: E 
##         Dichotomized
## Original   0   1
##        A 124   0
##        B 152   0
##        C 111   0
##        D 342   0
##        E   0 271
## 
## === Item: Item16 ===
## Key: E 
##         Dichotomized
## Original   0   1
##        A  57   0
##        B 595   0
##        C 107   0
##        D 121   0
##        E   0 120
## 
## === Item: Item17 ===
## Key: E 
##         Dichotomized
## Original   0   1
##        A 334   0
##        B 285   0
##        D 139   0
##        E   0 242
## 
## === Item: Item18 ===
## Key: E 
##         Dichotomized
## Original   0   1
##        A 470   0
##        B 160   0
##        C  59   0
##        D  19   0
##        E   0 292
## 
## === Item: Item19 ===
## Key: E 
##         Dichotomized
## Original   0   1
##        A 147   0
##        B  95   0
##        C 163   0
##        D 117   0
##        E   0 478
## 
## === Item: Item20 ===
## Key: E 
##         Dichotomized
## Original   0   1
##        A 135   0
##        B  73   0
##        C 310   0
##        D  27   0
##        E   0 455
## 
## === Item: Item21 ===
## Key: E 
##         Dichotomized
## Original   0   1
##        A 278   0
##        B  73   0
##        C 359   0
##        D 226   0
##        E   0  64
## 
## === Item: Item22 ===
## Key: E 
##         Dichotomized
## Original   0   1
##        A 380   0
##        B 105   0
##        C 212   0
##        D  38   0
##        E   0 265
## 
## === Item: Item23 ===
## Key: E 
##         Dichotomized
## Original   0   1
##        A  57   0
##        B  63   0
##        C 165   0
##        D 596   0
##        E   0 119
## 
## === Item: Item24 ===
## Key: E 
##         Dichotomized
## Original   0   1
##        A 338   0
##        B 105   0
##        C 403   0
##        D 149   0
##        E   0   5
## 
## === Item: Item25 ===
## Key: E 
##         Dichotomized
## Original   0   1
##        A 206   0
##        B 216   0
##        C  75   0
##        D  78   0
##        E   0 425
## 
## === Item: Item26 ===
## Key: E 
##         Dichotomized
## Original   0   1
##        A 231   0
##        B 242   0
##        C 143   0
##        D   5   0
##        E   0 379
## 
## === Item: Item27 ===
## Key: E 
##         Dichotomized
## Original   0   1
##        A 201   0
##        B  61   0
##        C  16   0
##        D 578   0
##        E   0 144
## 
## === Item: Item28 ===
## Key: E 
##         Dichotomized
## Original   0   1
##        A 384   0
##        B   5   0
##        C  62   0
##        D 449   0
##        E   0 100
## 
## === Item: Item29 ===
## Key: E 
##         Dichotomized
## Original   0   1
##        A 289   0
##        B 224   0
##        C  13   0
##        D  48   0
##        E   0 426
## 
## === Item: Item30 ===
## Key: E 
##         Dichotomized
## Original   0   1
##        A 267   0
##        B  59   0
##        C 265   0
##        D 215   0
##        E   0 194

2.4 Reliability Analysis for Dichotomous Data

2.4.1 Kuder-Richardson formula 20 (KR20)

The KR20 (Kuder-Richardson Formula 20) is a measure of internal consistency reliability for tests with binary (dichotomous) items (e.g., scored as 0 = incorrect, 1 = correct). It estimates how consistently all items in a test measure the same underlying construct.

\[ KR_{20} = \left[\frac{k}{k-1}\right] \times \left[\frac{\sigma_{X}^{2} - \sum\limits^{k}_{i=1} p_{i}(1-p_{i})}{\sigma^{2}_{X}}\right] \] where:

\(\quad k =\) number of test items

\(\quad p_i\) = proportion of examinees who answered item ii correctly

\(\quad \sigma_{X}^{2}\) = variance of total test scores

Key Properties:

  • Equivalent to Cronbach's alpha** but specifically for binary data.
  • Ranges from 0 to 1**, where:
Value Range Reliability Level Interpretation
> 0.9 Excellent Suitable for high-stakes decisions
0.8–0.9 Good Appropriate for most uses
0.7–0.8 Acceptable May need improvement
< 0.7 Low Test requires revision
  • Assumes items are parallel (similar difficulty and discrimination).

When to Use

  • For multiple-choice tests (right/wrong scoring)
  • When Cronbach's alpha is inappropriate (binary data only)
  • To check internal consistency of:
    • Exams
    • Surveys
    • Assessments

Limitations:

  • Requires all items to be binary (0/1)
  • Sensitive to test length (more items → higher reliability)
  • Does not account for guessing effects
  • Requires similar item variances for accurate estimation
  • R Implementation of Kuder-Richardson formula 20

This code defines a function KR20() that computes the Kuder–Richardson 20 reliability coefficient for a matrix/data frame of dichotomous item responses (0/1).

#' Calculate Kuder-Richardson 20 (KR-20) Reliability Coefficient
#'
#' Computes the KR-20 reliability coefficient for dichotomous (0/1) test items,
#' which is equivalent to Cronbach's alpha for binary data. This is the preferred
#' reliability measure when items have varying difficulty levels.
#'
#' @param X A matrix or data frame of binary responses (0 = incorrect, 1 = correct)
#'           with rows representing respondents and columns representing items.
#'           Missing values are allowed and will be handled through listwise deletion.
#'
#' @return A list containing:
#' \itemize{
#'   \item \code{KR20} - Numeric value between 0-1 representing reliability
#'   \item \code{interpretation} - Qualitative reliability assessment
#'   \item \code{n_items} - Number of test items
#'   \item \code{mean_score} - Mean total test score
#'   \item \code{var_scores} - Variance of total scores
#'   \item \code{item_difficulty} - Vector of item difficulties (proportion correct)
#' }
#'
#' @details
#' The KR-20 formula is:
#' \deqn{KR_{20} = \frac{k}{k-1}\left(1 - \frac{\sum_{i=1}^k p_i(1-p_i)}{\sigma_X^2}\right)}
#' where:
#' \itemize{
#'   \item \eqn{k} = number of items
#'   \item \eqn{p_i} = proportion correct for item i
#'   \item \eqn{\sigma_X^2} = variance of total scores
#' }
#'
#' @examples
#' # Simulate test data (100 respondents, 20 items)
#' set.seed(123)
#' responses <- matrix(rbinom(100*20, 1, 0.7), nrow = 100)
#' 
#' # Calculate reliability
#' result <- KR20(responses)
#' print(result$KR20)
#' cat(result$interpretation)
#'
#' @references
#' Kuder, G. F., & Richardson, M. W. (1937). The theory of the estimation of test reliability. 
#' Psychometrika, 2(3), 151-160.
#' 
#' @seealso \code{\link{KR21}} for the simplified reliability coefficient, 
#'          \code{\link[psych]{alpha}} for general reliability calculation
#' @export
KR20 <- function(X) {
  # Input validation and conversion
  if (!is.matrix(X) && !is.data.frame(X)) {
    stop("Input must be a matrix or data frame")
  }
  
  X <- as.matrix(X)
  
  bad <- !is.na(X) & !(X %in% c(0, 1))
  if (any(bad)) {
    stop("All items must be dichotomous (0/1) or NA")
  }
  
  
  # Listwise deletion of missing values
  complete_cases <- complete.cases(X)
  if (sum(!complete_cases) > 0) {
    warning(sprintf("Removed %d cases with missing values", sum(!complete_cases)))
    X <- X[complete_cases, , drop = FALSE]
  }
  
  if (nrow(X) < 2) {
    stop("At least 2 complete cases required")
  }
  
  if (ncol(X) < 2) {
    stop("At least 2 items required for reliability calculation")
  }
  
  # Calculate components
  k <- ncol(X)
  total_scores <- rowSums(X)
  SX <- var(total_scores)
  M <- mean(total_scores)
  IM <- colMeans(X, na.rm = TRUE)  # Item difficulties
  
  # Handle zero variance case
  # after computing SX
  if (is.na(SX) || SX <= .Machine$double.eps) {
    warning("Near-zero variance in total scores — reliability undefined")
    return(list(
      KR20 = NA_real_,
      interpretation = "Undefined (zero/near-zero variance in scores)",
      n_items = k,
      mean_score = M,
      var_scores = SX,
      item_difficulty = IM
    ))
  }
  
  
  # Compute KR-20
  item_variances <- sum(IM * (1 - IM))
  kr20 <- (k / (k - 1)) * (1 - item_variances / SX)
  
  # Interpretation guide
  reliability_level <- cut(kr20,
                           breaks = c(-Inf, 0.5, 0.6, 0.7, 0.8, 0.9, Inf),
                           labels = c("Unacceptable", "Poor", "Questionable", 
                                      "Acceptable", "Good", "Excellent"))
  
  # Return comprehensive results
  list(
    KR20 = kr20,
    interpretation = sprintf(
      "KR-20 = %.3f (%s reliability for educational measurements)",
      kr20, as.character(reliability_level)),
    n_items = k,
    mean_score = M,
    var_scores = SX,
    item_difficulty = IM
  )
}

# Save function to file
dump("KR20", file = "KR20.R")
# Calculate reliability
result <- KR20(dich_data)
print(result$KR20)
## [1] 0.91617
cat(result$interpretation)
## KR-20 = 0.916 (Excellent reliability for educational measurements)

The scale demonstrated acceptable internal consistency, with a KR-20 coefficient of 0.916, which falls within the excellent range for educational and psychological measurements.

2.4.2 Kuder-Richardson formula 21 (KR21)

The Kuder–Richardson Formula 21 (KR-21) is an index of internal consistency reliability for dichotomous items (e.g., right/wrong, true/false, 0/1). It is derived as a special case of KR-20 under the restrictive assumption that all items have equal difficulty and discrimination.

Because KR-21 only requires the number of items, the mean score, and the total variance, it is computationally simpler than KR-20. However, this simplicity comes at a cost: if item difficulties vary substantially, KR-21 will generally underestimate reliability.

For this reason, KR-21 is best viewed as a rough lower-bound estimate and should only be used when items are fairly homogeneous in difficulty. When item-level data are available, KR-20 (or Cronbach’s α, its generalization) is typically preferred.

\[ KR_{21} = \frac{k}{k - 1} \times \left( 1 - \frac{M(k - M)}{k \sigma_X^2} \right) \] where:

\(\quad k\) = number of items,

\(\quad M\) = mean total test score,

\(\quad \sigma_X^2\) = variance of total test scores.

Which Formula Should You Use?

Use KR-21 when - Items are approximately equal in difficulty (e.g., a balanced true/false quiz). - You need a quick, rough estimate without item-level statistics.

Use KR-20 when - Item difficulties vary meaningfully. - You can compute item proportions (\(p_i\)) or item variances.

Notes - KR-21 assumes equal difficulty; when that’s false, it typically underestimates reliability relative to KR-20.
- For dichotomous items, KR-20 is numerically equivalent to Cronbach’s \(\alpha\).
- Either coefficient requires nonzero variance in total scores.


R Implementation of Kuder-Richardson formula 21:

#' Calculate Kuder-Richardson Formula 21 (KR-21) Reliability Coefficient
#'
#' Computes the KR-21 reliability coefficient for dichotomous (0/1) test items.
#' This simplified reliability estimate assumes all items have equal difficulty.
#'
#' @param X A matrix or data frame containing dichotomous item responses (0 = incorrect, 1 = correct).
#'          Rows represent respondents, columns represent test items.
#'
#' @return A list containing:
#' \itemize{
#'   \item \code{KR21} - The calculated reliability coefficient (numeric)
#'   \item \code{interpretation} - Qualitative reliability assessment (character)
#'   \item \code{n_items} - Number of test items (integer)
#'   \item \code{mean_score} - Mean total test score (numeric)
#'   \item \code{var_scores} - Variance of total scores (numeric)
#' }
#'
#' @details
#' The KR-21 formula is:
#' \deqn{KR_{21} = \frac{k}{k-1}\left(1 - \frac{M(k - M)}{k\sigma^2}\right)}
#' where:
#' \itemize{
#'   \item \eqn{k} = number of items
#'   \item \eqn{M} = mean total score
#'   \item \eqn{\sigma^2} = variance of total scores
#' }
#'
#' @note
#' For tests with varying item difficulty, consider using \code{\link{KR20}} or \code{\link{cronbach.alpha}} instead.
#' KR-21 may overestimate reliability when items have substantially different difficulty levels.
#'
#' @examples
#' # Generate sample dichotomous data
#' set.seed(123)
#' test_data <- matrix(rbinom(100, 1, 0.6), ncol = 10)
#' 
#' # Calculate KR-21
#' result <- KR21(test_data)
#' print(result$KR21)
#' cat(result$interpretation)
#'
#' @references
#' Kuder, G. F., & Richardson, M. W. (1937). The theory of the estimation of test reliability. 
#' Psychometrika, 2(3), 151-160.
#'
#' @seealso \code{\link{KR20}} for the more general reliability coefficient
#' @export
KR21 <- function(X) {
  # Input validation
  if (!is.matrix(X) && !is.data.frame(X)) {
    stop("Input must be a matrix or data frame")
  }
  
  X <- data.matrix(X)
  
  if (any(!X %in% c(0, 1))) {
    stop("All items must be dichotomous (0 or 1)")
  }
  
  if (ncol(X) < 2) {
    stop("At least 2 items required for reliability calculation")
  }
  
  # Calculate components
  n <- ncol(X)
  total_scores <- rowSums(X)
  M <- mean(total_scores)
  var_total <- var(total_scores)
  
  # Handle case with zero variance
  if (var_total == 0) {
    warning("Zero variance in total scores - reliability undefined")
    return(list(
      KR21 = NA_real_,
      interpretation = "Undefined (zero variance in scores)",
      n_items = n,
      mean_score = M,
      var_scores = 0
    ))
  }
  
  # Compute KR-21
  kr21 <- (n / (n - 1)) * (1 - (M * (n - M)) / (n * var_total))
  
  # Interpretation guide
  reliability_level <- cut(kr21,
                           breaks = c(-Inf, 0.5, 0.6, 0.7, 0.8, 0.9, Inf),
                           labels = c("Unacceptable", "Poor", "Questionable", 
                                      "Acceptable", "Good", "Excellent"))
  
  # Return comprehensive results
  list(
    KR21 = kr21,
    interpretation = sprintf(
      "KR-21 = %.3f (%s reliability for educational measurements)",
      kr21, as.character(reliability_level)),
    n_items = n,
    mean_score = M,
    var_scores = var_total
  )
}

dump("KR21", file = "KR21.R")
result <- KR21(dich_data)
print(result$KR21)
## [1] 0.90519
cat(result$interpretation)
## KR-21 = 0.905 (Excellent reliability for educational measurements)

The KR-21 estimate (0.905) was slightly lower, indicating that the reliability may be marginal for high-stakes assessments.


2.4.3 Split-half (Test-Retest) Reliability

Split-half reliability is a method for assessing the internal consistency of a test by dividing it into two equivalent halves and computing the correlation between them. The resulting correlation is then adjusted using the Spearman-Brown prophecy formula to estimate the reliability of the full-length test.

Function: Split data (cases) into two about equal-sized random sets

#' Randomly Split Items (Columns) into Two Groups
#'
#' This function randomly divides the columns (items) of a response matrix or 
#' data frame into two subsets, as evenly as possible. 
#' Useful for split-half reliability calculations and related psychometric checks.
#'
#' @param X A numeric matrix or data frame of item responses (rows = persons, 
#'   columns = items).
#' @param seed Optional integer. If provided, sets a random seed for 
#'   reproducibility of the column split.
#'
#' @details
#' - If the number of items is even, both halves will have equal size.  
#' - If the number of items is odd, the first half (`Y1`) will contain one 
#'   additional item compared to the second half (`Y2`).  
#'
#' @return A list of two matrices:
#' \describe{
#'   \item{Y1}{Matrix of responses for the first half of items.}
#'   \item{Y2}{Matrix of responses for the second half of items.}
#' }
#'
#' @examples
#' set.seed(123)
#' X <- matrix(sample(1:5, 50, replace = TRUE), nrow = 10, ncol = 5)
#' split_result <- SPLIT.ITEMS(X, seed = 42)
#' str(split_result)
#'
#' @export
SPLIT.ITEMS <- function(X, seed = NULL) {
  # optional fixed seed
  if (!is.null(seed)) {set.seed(seed)} 
  
  X <- as.matrix(X)
  
  # number of items (columns)
  k <- ncol(X)
  
  # randomly choose ~half of the items
  index <- sample(seq_len(k), ceiling(k/2))
  
  Y1 <- X[, index, drop = FALSE]
  Y2 <- X[, -index, drop = FALSE]
  
  return(list(Y1, Y2))
}

dump("SPLIT.ITEMS", file = "SPLIT.ITEMS.R")

2.4.4 Spearman-Brown formula

The Spearman-Brown (Brown, 1910; Spearman, 1910) formula is used to predict the reliability of a test after changing the test length.

\[ \rho_{XX'} = \frac{N \cdot \rho_{YY'}}{1+(N-1) \cdot \rho_{YY'}} \]

where

\(\qquad N\) is the factor by which the length of the test is changed,

\(\qquad \rho_{XX'}\) is the predicted reliability coefficient, and

\(\qquad \rho_{YY'}\) is reliability of the original test

# SAFELY LOAD PACKAGES
load_packages <- function(packages) {
  # Loop through each package
  for (pkg in packages) {
    if (!requireNamespace(pkg, quietly = TRUE)) {
      install.packages(pkg)
    }
    library(pkg, character.only = TRUE)  # Load the package
  }
}

required_packages <- c("psych", "kableExtra", "dplyr", "ggplot2")
load_packages(required_packages)

# SPEARMAN-BROWN FUNCTION
spearman_brown <- function(alpha, N) {
  # N = 0.5 (halve), 2 (double), 3 (triple)
  (N * alpha) / (1 + (N - 1) * alpha)
}

# Calculate alpha
alpha_result <- psych::alpha(dich_data[, 1:30])$total$raw_alpha

# Create predictions
predictions <- data.frame(
  Scenario = c("Current", "Halved", "Doubled", "Tripled"),
  Items = c(30, 15, 60, 90),
  Alpha = round(c(
    alpha_result,
    spearman_brown(alpha_result, 15/30),
    spearman_brown(alpha_result, 2),
    spearman_brown(alpha_result, 3)
  ), 3)
)

# Create table
kableExtra::kbl(
  predictions,
  caption = "Reliability Projections",
  col.names = c("Scenario", "Items", "Cronbach's α")
) %>%
  kableExtra::kable_styling(
    bootstrap_options = c("striped", "hover"),
    full_width = FALSE,
    position = "center"
  ) %>%
  kableExtra::column_spec(3, bold = TRUE)
Reliability Projections
Scenario Items Cronbach's α
Current 30 0.916
Halved 15 0.845
Doubled 60 0.956
Tripled 90 0.970

R Implementation of Split-Half Reliability with Spearman-Brown Adjustment

#' Calculate Split-Half Reliability with Spearman-Brown Adjustment
#'
#' Estimates test reliability by splitting items into two halves and computing the
#' correlation between halves, then adjusting for test length using the
#' Spearman-Brown prophecy formula.
#'
#' @param X A matrix or data frame of test responses (rows = respondents, columns = items)
#' @param method Splitting method: "odd_even", "first_last", or "random" (default)
#' @param n_splits Number of random splits to average (for method = "random")
#' @param spearman_brown Logical indicating whether to apply Spearman-Brown correction (default = TRUE)
#'
#' @return A list containing:
#' \itemize{
#'   \item \code{reliability} - Estimated reliability coefficient
#'   \item \code{split_correlation} - Raw correlation between halves
#'   \item \code{method} - Splitting method used
#'   \item \code{spearman_brown} - Whether correction was applied
#'   \item \code{half1_scores} - Scores for first half
#'   \item \code{half2_scores} - Scores for second half
#' }
#'
#' @details
#' The Spearman-Brown prophecy formula adjusts the split-half correlation to estimate
#' the reliability of the full-length test:
#' \deqn{r_{xx} = \frac{2 \cdot r_{12}}{1 + r_{12}}}
#' where \eqn{r_{12}} is the correlation between the two halves.
#'
#' @examples
#' # Simulate test data (30 items, 100 respondents)
#' set.seed(123)
#' likert_numeric <- matrix(rbinom(100*30, 1, 0.6), nrow = 100)
#'
#' # Calculate reliability
#' result <- split_half_reliability(likert_numeric, method = "odd_even")
#' print(result$reliability)
#'
#' @references
#' Spearman, C. (1910). Correlation calculated from faulty data. 
#' British Journal of Psychology, 3(3), 271-295.
#'
#' Brown, W. (1910). Some experimental results in the correlation of mental abilities.
#' British Journal of Psychology, 3(3), 296-322.
#'
#' @seealso \code{\link{KR20}}, \code{\link{KR21}} for internal consistency measures
#' @export
split_half_reliability <- function(X, method = "random", n_splits = 20, spearman_brown = TRUE) {
  # Input validation
  if (!is.matrix(X) && !is.data.frame(X)) {
    stop("Input must be a matrix or data frame")
  }
  
  X <- data.matrix(X)
  n_items <- ncol(X)
  
  if (n_items < 4) {
    stop("At least 4 items required for split-half reliability")
  }
  
  if (nrow(X) < 3) {
    stop("At least 3 respondents required")
  }
  
  # Split the test according to specified method
  if (method == "odd_even") {
    half1 <- seq(1, n_items, by = 2)
    half2 <- seq(2, n_items, by = 2)
  } else if (method == "first_last") {
    half1 <- 1:floor(n_items/2)
    half2 <- (floor(n_items/2)+1):n_items
  } else if (method == "random") {
    # Average multiple random splits for stability
    split_cors <- vector("numeric", n_splits)
    
    for (i in seq_len(n_splits)) {
      split <- sample(1:n_items, size = floor(n_items/2))
      half1 <- split
      half2 <- setdiff(1:n_items, split)
      
      scores1 <- rowSums(X[, half1], na.rm = TRUE)
      scores2 <- rowSums(X[, half2], na.rm = TRUE)
      
      split_cors[i] <- cor(scores1, scores2, use = "pairwise.complete.obs")
    }
    
    r12 <- mean(split_cors, na.rm = TRUE)
    half1 <- half2 <- NULL  # No specific halves for random splits
  } else {
    stop("Invalid method: choose 'odd_even', 'first_last', or 'random'")
  }
  
  # For non-random methods
  if (method != "random") {
    scores1 <- rowSums(X[, half1], na.rm = TRUE)
    scores2 <- rowSums(X[, half2], na.rm = TRUE)
    r12 <- cor(scores1, scores2, use = "pairwise.complete.obs")
  }
  
  # Apply Spearman-Brown correction
  if (spearman_brown) {
    reliability <- (2 * r12) / (1 + r12)
  } else {
    reliability <- r12
  }
  
  # Return results
  structure(
    list(
      reliability = reliability,
      split_correlation = r12,
      method = method,
      spearman_brown = spearman_brown,
      half1_scores = if (method != "random") scores1 else NULL,
      half2_scores = if (method != "random") scores2 else NULL,
      n_splits = if (method == "random") n_splits else NULL
    ),
    class = "split_half_reliability"
  )
}

#' Print Method for Split-Half Reliability Object
#'
#' @param x split_half_reliability object
#' @param ... Additional arguments passed to print
#'
#' @export
print.split_half_reliability <- function(x, ...) {
  cat("\nSplit-Half Reliability Analysis\n")
  cat("Method:", x$method, "\n")
  if (x$method == "random") cat("Number of splits:", x$n_splits, "\n")
  cat("Split Correlation:", round(x$split_correlation, 3), "\n")
  if (x$spearman_brown) {
    cat("Spearman-Brown Adjusted Reliability:", round(x$reliability, 3), "\n")
  }
  invisible(x)
}
# Basic usage with odd-even split
result <- split_half_reliability(dich_data, method = "odd_even")
print(result)
## 
## Split-Half Reliability Analysis
## Method: odd_even 
## Split Correlation: 0.858 
## Spearman-Brown Adjusted Reliability: 0.923
# With random splits (more stable estimate)
result_random <- split_half_reliability(dich_data, method = "random", n_splits = 100)
print(result_random)
## 
## Split-Half Reliability Analysis
## Method: random 
## Number of splits: 100 
## Split Correlation: 0.849 
## Spearman-Brown Adjusted Reliability: 0.919
# Without Spearman-Brown correction
result_raw <- split_half_reliability(dich_data, spearman_brown = FALSE)

These results assess the internal consistency of a test by splitting it into two halves and computing the correlation between them. The Spearman-Brown prophecy formula adjusts this correlation to estimate the reliability of the full-length test.

Split-half reliability analyses were conducted using both odd-even and random splits. The Spearman-Brown adjusted reliability coefficients were 0.858 (odd-even) and 0.849 (random splits), further supporting the scale’s moderate to acceptable reliability.

1. Odd-Even Split Results:

Method: Items were split into odd-numbered vs. even-numbered halves.

  • Split Correlation (r₁₂): 0.858
    • This is the raw correlation between the two halves.
    • Indicates good consistency between halves.
  • Spearman-Brown Adjusted Reliability: 0.923
    • Estimates the reliability of the full test (not just the halves).
    • A value of 0.923 is considered excellent in most psychometric contexts (generally, ≥ 0.7 is acceptable, ≥ 0.8 is good).

What does this mean?

  • The test has moderate to good internal consistency.
  • If the test were doubled in length (with similar items), its reliability would be estimated at 0.858.

2. Random Split Results (100 Splits)

  • Method: Items were randomly split 100 times, and results were averaged.
  • Split Correlation (r₁₂): 0.849
  • Slightly lower than the odd-even split, suggesting some variability in reliability depending on how the test is divided.
  • Spearman-Brown Adjusted Reliability: 0.918
  • Still in the excellent range (≥ 0.9), but slightly lower than the odd-even estimate.

Why is the random split result different?

  • The odd-even method may artificially inflate reliability if items are ordered by difficulty or topic.
  • The random method averages many splits, giving a more stable and realistic estimate.
  • The difference (0.858 vs. 0.849) suggests that the odd-even method may be slightly overestimating reliability.

2.4.5 Guttman's Lambda

  • L1: An intermediate coefficient used in computing the other lambdas.
  • L2: More complex than Cronbach's alpha and preferred by some researchers, though less common.
  • L3: Equivalent to Cronbach's alpha.
  • L4: Guttman split-half reliability.
  • L5: Recommended when a single item highly covaries with other items, which themselves lack high covariances with each other.
  • L6: Recommended when inter-item correlations are low in relation to squared multiple correlations

 

Guttman's \(\lambda_3\) is the same as Cronbach's \(\alpha\) (Cronbach, 1951) and may be computed by

\[ \lambda_3 = \alpha = \frac{K}{K-1} \times \left( 1-\frac{\sum_{i=1}^{K} \sigma^{2}_{Y_i}}{\sigma^{2}_{X}}\right) \]

If the items are scored 0 and 1, a shortcut formula is

\[ \lambda_3 = \alpha = \frac{k}{k-1} \times \left( 1-\frac{\sum_{i=1}^{k} P_i \cdot Q_i}{\sigma^{2}_{X}}\right) \]

where

\(\qquad \sigma^{2}_{X}\) is the variance of the observed total test scores, and

\(\qquad \sigma^{2}_{Y_i}\) is the variance of component i for the current sample of persons.

\(\qquad P_i\) is the proportion scoring 1 on item i, and

\(\qquad Q_i = 1-P_i\).


Function: Computes Guttman's \(\lambda_3\)

#' Compute Guttman's Lambda-3 (Cronbach's Alpha)
#'
#' Estimates Guttman's Lambda-3 reliability coefficient (equivalent to Cronbach's Alpha)
#' from a matrix or data frame of numeric item responses, typically Likert-style data.
#'
#' @param likert_numeric A data frame or matrix of numeric responses (rows = respondents, columns = items).
#'
#' @return A numeric value representing Guttman's Lambda-3 (Cronbach's alpha).
#'
#' @details
#' The function computes total score variance and item-level variances to estimate
#' the reliability of a scale. This implementation assumes that higher total score
#' variance relative to item variance implies greater internal consistency.
#'
#' \deqn{\alpha = \frac{k}{k - 1} \left(1 - \frac{\sum \sigma^2_i}{\sigma^2_X} \right)}
#' where:
#' \itemize{
#'   \item \( k \): number of items
#'   \item \( \sigma^2_i \): variance of each item
#'   \item \( \sigma^2_X \): variance of the total scores
#' }
#'
#' @references
#' Cronbach, L. J. (1951). Coefficient alpha and the internal structure of tests.
#' Psychometrika, 16(3), 297–334.
#'
#' Guttman, L. (1945). A basis for analyzing test-retest reliability.
#' Psychometrika, 10(4), 255–282.
#'
#' @seealso
#' \code{\link[psych]{alpha}} for more advanced reliability analysis.
#'
#' @examples
#' \dontrun{
#' # Simulated Likert data
#' likert_data <- matrix(sample(1:5, 1000, replace = TRUE), ncol = 10)
#' G3(likert_data)
#' }
#'
#' @export
G3 <- function(likert_numeric) {
  
  X <- as.matrix(likert_numeric)
  k <- ncol(X) # number of items
  
  SX <- var(rowSums(X))      # variance of total scores
  SI <- apply(X, 2, var)     # variance of each item
  L1 <- 1 - (sum(SI) / SX)   # Guttman's Lambda-1
  
  return((k * L1) / (k - 1)) # Equivalent to Cronbach's Alpha
}

# save to file
dump("G3", file = "G3.R")
# compute G3
G3(dich_data)
## [1] 0.91605
# compare
require(psych, warn.conflicts = FALSE, quietly = TRUE)
guttman(dich_data)
## Call: guttman(r = dich_data)
## 
## Alternative estimates of reliability
## 
## Guttman bounds 
## L1 =  0.88 
## L2 =  0.92 
## L3 (alpha) =  0.92 
## L4 (max) =  0.93 
## L5 =  0.9 
## L6 (smc) =  0.92 
## TenBerge bounds 
## mu0 =  0.92 mu1 =  0.92 mu2 =  0.92 mu3 =  0.92 
## 
## alpha of first PC =  0.92 
## estimated greatest lower bound based upon communalities=  0.93 
## 
## beta found by splitHalf  =  0.89

2.4.6 Pearson product-moment correlation coefficient

\[ \begin{aligned} \rho_{YY'} &= cor(Y, Y') \\ \\ &= \frac{cov(Y, Y')}{\sigma_Y \cdot \sigma_{Y'}} \\ \\ &= \frac{E \Big[(Y-\mu_Y) \times (Y'-\mu_{Y'})\Big]}{\sigma_Y \times \sigma_{Y'}} \end{aligned} \]

If we have a series of n measurements of Y and Y' written as \(y_i\) and \(y_i\) where i = 1, 2, ..., K, then the sample correlation coefficient can be used to estimate the population Pearson correlation r between Y and Y'.

\[ \begin{aligned} r_{YY'} &= \frac{\sum\limits^{K}_{i=1}(y_i-\bar{y}_i) \times (y'_i-\bar{y}'_i)}{(K-1) \times S_Y \times S_{Y'}} \\ \\ &= \frac{\sum\limits^{K}_{i=1}\Big[(y_i-\bar{y}_i) \times (y'_i-\bar{y}'_i)\Big]}{\sqrt{\sum\limits^{K}_{i=1}(y_i-\bar{y}_i)^2 \times \sum\limits^{K}_{i=1}(y'_i-\bar{y}'_i)^2}} \end{aligned} \]


Function: Computes a bootstrapped (1,000 replicates) Pearson product-moment correlation coefficient between two random subsets of examinees.

#' Estimate Split-Half Reliability Using Pearson Correlation
#'
#' Estimates split-half reliability via the Pearson product-moment correlation
#' between total scores of two randomly split halves of the test. Repeats the
#' process over multiple bootstrap replicates and returns the mean correlation.
#'
#' @param X A matrix or data frame of numeric item responses (rows = examinees, columns = items).
#' @param seed Optional integer to set the random seed for reproducibility.
#' @param n Integer specifying the number of bootstrap replicates (default = 1000).
#'
#' @return A numeric value representing the average Pearson split-half reliability across \code{n} random splits.
#'
#' @details
#' For each bootstrap iteration, items are randomly divided into two halves
#' using the external function \code{SPLIT.ITEMS()}. The total scores for each half
#' are computed, centered (residualized), and correlated using the Pearson formula.
#' The mean of these correlations across replicates is returned as an estimate
#' of split-half reliability.
#'
#' This method assumes unidimensionality and equal-length halves in expectation.
#'
#' @references
#' Eisinga, R., Grotenhuis, M. T., & Pelzer, B. (2013). The reliability of a two-item scale:
#' Pearson, Cronbach, or Spearman-Brown? \emph{International Journal of Public Health}, 58, 637–642.
#'
#' @seealso
#' \code{\link{SPLIT.ITEMS}} for item-splitting procedure.  
#' \code{\link{cor}}, \code{\link[psych]{splitHalf}} for alternative implementations.
#'
#' @examples
#' \dontrun{
#' # Simulated data
#' X <- matrix(sample(1:5, 1000, replace = TRUE), ncol = 10)
#' pearson(X, seed = 123, n = 500)
#' }
#'
#' @export
pearson <- 
  function(X, seed = NULL, n = NULL){
    source("SPLIT.ITEMS.R")  # Assumes SPLIT.ITEMS is available
    
    # optional fixed seed
    if (!is.null(seed)) {set.seed(seed)}
    
    # default number of replicates
    if (is.null(n)) {n <- 1e3}   
    
    X <- as.matrix(X)
    r <- rep(NA, n)
    
    for (i in 1:n) {
      # split items
      Y <- SPLIT.ITEMS(X)
      
      # total scores for each half
      S1 <- as.matrix(rowSums(Y[[1]]))
      S2 <- as.matrix(rowSums(Y[[2]]))
      
      # residual scores
      R1 <- S1 - mean(S1)
      R2 <- S2 - mean(S2)
      
      # Pearson product-moment correlation coefficient
      r[i] <- (t(R1) %*% R2) / (sqrt((t(R1) %*% R1)) * sqrt((t(R2) %*% R2)))
    }
    
    return(mean(r))
  }


dump("pearson", file = "pearson.R")
# compute the Pearson product-moment correlation coefficient
pearson(dich_data, seed = 456, n = 1)
## [1] 0.83673
# compare
# split items
set.seed(456)
Y <- SPLIT.ITEMS(dich_data)

# total scores
S1 <- as.matrix(rowSums(Y[[1]]))
S2 <- as.matrix(rowSums(Y[[2]]))

cor(S1, S2)
##         [,1]
## [1,] 0.83673

When scores are not tau-equivalent (for example when there is not homogeneous but rather examination items of increasing difficulty) then the KR-20 is an indication of the lower bound of internal consistency (reliability).

KR21 typically underestimates the reliability of a test compared to KR20, when the item difficulties are not equal. The formulas will be equal only if the item difficulties are equal.

Cronbach's \(\pmb{\alpha}\) and Kuder-Richardson (Kuder & Richardson, 1937) methods produce a lower bound for a test's reliability. This lower bound equals the test reliability if the split-halves are essentially \(\pmb{\tau}\)-equivalent. These coefficients should only be used for homegeneous tests, since they reflect item homegeneity, else they will be inappropriately low.

A high KR-20 coefficient (e.g., > 0.90) provides an evidence for the assumption of a homogeneous or unidimensional test. However, it is possible, to have a high KR-20 with a multidimensional scale, especially with a large number of test items.

The Spearman-Brown coefficient can over or underestimate the test reliability if the split-halves are not parallel tests. When the tests are parallel the Spearman-Brown formula is useful to test the effects of test length on reliability

A test can produce different reliability estimates when administered to different samples of examinees. Generazibility Theory examines such systematic effects on reliability.


2.5 Standard Error of Measurement

Standard error of measurement using Cronbach's alpha as the reliability statistic cab be computed as

\[ S_{E} = S_{X} \times \sqrt{1-r_{XX'}} \]

#' Calculate Standard Error of Measurement (SEM)
#'
#' Computes the Standard Error of Measurement for a test based on
#' the standard deviation of total scores and the reliability
#' coefficient estimated by Cronbach's alpha.
#'
#' @param X A numeric matrix or data frame of item responses (rows = examinees, columns = items).
#'
#' @return A numeric value representing the SEM.
#'
#' @details
#' SEM is computed using the formula:
#' \deqn{
#' \mathrm{SEM} = \sigma_X \sqrt{1 - \alpha}
#' }
#' where \(\sigma_X\) is the standard deviation of total test scores and
#' \(\alpha\) is Cronbach's alpha, an estimate of reliability.
#'
#' This function requires the external function \code{cronbachs_alpha()}
#' to be sourced from \code{"cronbachs_alpha.R"}.
#'
#' @references
#' Crocker, L., & Algina, J. (1986). \emph{Introduction to classical and modern test theory}.
#' Holt, Rinehart and Winston.
#'
#' @seealso
#' \code{\link{cronbachs_alpha}} for computing reliability.
#'
#' @examples
#' \dontrun{
#' X <- matrix(sample(1:5, 500, replace = TRUE), ncol = 10)
#' SEM(X)
#' }
#'
#' @export
SEM <- function(X) {
  source("cronbachs_alpha.R")
  X <- data.matrix(X)
  
  return(sd(rowSums(X)) * sqrt(1 - cronbachs_alpha(X)))
}
SEM(dich_data)
## [1] 1.9505

2.6 Confidence Intervals for True Scores

A confidence interval for an examinees true score can be constructed as

\[ X_i - z_c \cdot S_E \leq \tau \leq X_i + z_c \cdot S_E \]

For example, the critical-value for a 90% confidence interval is 1.65, so for the first examinee \((i=1)\) with an observed score of 19, the confidence interval is:

\[ \begin{aligned} 19-1.65*2.11 &\leq \tau \leq 19+1.65*2.11 \\ 15.52 &\leq \tau \leq 22.48 \end{aligned} \]

# -----------------------------------
# COMPLETE KR-20 RELIABILITY ANALYSIS
# WITH CONFIDENCE INTERVALS
# -----------------------------------

# Load required package
if (!require("psych")) {
  install.packages("psych")
}
library(psych)

# Calculate KR-20 reliability
kr20_result <- psych::alpha(dich_data, check.keys = FALSE)
kr20 <- kr20_result$total$raw_alpha
cat("KR-20 Reliability Coefficient:", round(kr20, 3), "\n")
## KR-20 Reliability Coefficient: 0.916
# Calculate confidence intervals for true scores
if (exists("dich_data")) {
  # Calculate total scores
  total_scores <- rowSums(dich_data, na.rm = TRUE)
  
  # Calculate Standard Error of Measurement (SEM)
  sem <- sd(total_scores, na.rm = TRUE) * sqrt(1 - kr20)
  
  # Create confidence interval data frame
  ci_data <- data.frame(
    ID = 1:length(total_scores),
    Observed_Score = total_scores,
    CI_Lower = round(total_scores - 1.96 * sem, 1),  # 95% CI
    CI_Upper = round(total_scores + 1.96 * sem, 1)
  )
  
  # Print first 10 cases
  cat("\nScore Confidence Intervals (first 10 cases):\n")
  print(head(ci_data, 10))
}
## 
## Score Confidence Intervals (first 10 cases):
##    ID Observed_Score CI_Lower CI_Upper
## 1   1              3     -0.8      6.8
## 2   2              4      0.2      7.8
## 3   3             23     19.2     26.8
## 4   4              7      3.2     10.8
## 5   5              3     -0.8      6.8
## 6   6             23     19.2     26.8
## 7   7              8      4.2     11.8
## 8   8              2     -1.8      5.8
## 9   9              1     -2.8      4.8
## 10 10              1     -2.8      4.8

2.7 Unidimensionality of the Latent Trait

Core Concept: Test items demonstrate unidimensionality when they measure:

  • A single dominant latent trait (primary factor loading > 0.6)
  • Minimal influence from secondary factors (cross-loadings < 0.3)

Importance in Measurement

Aspect Unidimensional Case Multidimensional Case
Interpretability Clear construct interpretation Ambiguous trait attribution
Reliability High internal consistency (α/ω > 0.8) Inflated reliability estimates
Validity Strong evidence for construct validity Confounded trait measurement

Statistical Evaluation Methods:

  • Factor Analysis Criteria
    • Primary factor loadings: All items > 0.6 on target factor
    • Cross-loadings: All items < 0.3 on secondary factors
    • Variance explained: First factor > 40% of total variance
  • Kaiser's Eigenvalue Rule (Critique)
    • Original rule: Retain factors with eigenvalues ≥ 1.0
    • Limitations:
      • Overestimates factors in large datasets
      • Underestimates factors with correlated items
      • Should never be used as sole criterion

Kaiser-Meyer-Olkin (KMO) Measure: Quantifies sampling adequacy for factor analysis

Interpretation Guidelines

KMO Range Kaiser's Classification Recommended Action
0.90 - 1.00 Marvelous Ideal for factor analysis
0.80 - 0.89 Meritorious Very suitable for analysis
0.70 - 0.79 Middling Acceptable but interpret with caution
0.60 - 0.69 Mediocre Marginal adequacy - consider item revision
0.50 - 0.59 Miserable Unacceptable without modifications
< 0.50 Unacceptable Completely unsuitable for FA

2.7.1 Horn's Parallel Analysis (PA)

Horn–Glorfeld Parallel Analysis (Glorfeld, 1995; Horn, 1965) is a method used to determine the number of factors or components to retain in exploratory factor analysis or principal component analysis. It works by comparing the observed eigenvalues—which represent the variances explained by each principal component—with those obtained from randomly generated datasets via Monte Carlo simulation.

A component is retained if its observed eigenvalue is greater than the 95th percentile of the distribution of eigenvalues derived from the simulated random data. Since the first principal component explains the most variance, it will have the highest eigenvalue, and each successive component will explain progressively less. Scree Plot and Cattell's Scree Test

A Scree Plot visualizes the eigenvalues of components on the Y-axis against the number of components on the X-axis. As one moves rightward along the X-axis (increasing number of components), eigenvalues typically drop steeply and then begin to level off, forming an "elbow" in the curve.

According to Cattell’s Scree Test (Cattell, 1966), the optimal number of components to retain corresponds to the point just before the curve flattens—that is, the components to the left of the elbow are considered meaningful, while those to the right likely represent noise.

Core Methodology

  • Monte Carlo Simulation
  • Generate 1,000 or more random datasets that match the original data in terms of:
  • Number of variables
  • Sample size

For each simulated dataset, compute eigenvalues to build a distribution of expected values under the null hypothesis (i.e., no underlying structure).

Decision Rule:

  • Retain components where:
    • Observed eigenvalue > 95th percentile of simulated eigenvalues
    • This criterion controls the Type I error rate at α = .05, ensuring that retained components reflect meaningful structure rather than random variation.
#' Parallel Analysis for Dimensionality Assessment
#'
#' Performs parallel analysis using both PCA and factor analysis (PAF) approaches
#' using the `psych` package.
#' @param likert_numeric A matrix or data frame of test responses
#' @param iterations Number of Monte Carlo simulations (default: 1000)
#' @param seed Random seed for reproducibility (default: 1804)
#' @return A list containing results from fa.parallel (including plots)
#' @examples 
#' data(bfi, package = "psych")
#' results <- parallel_analysis(bfi[, 1:30])

parallel_analysis <- function(responses_numeric, 
                              iterations = 1000, 
                              seed = 1804) {
  
  # Input validation
  if (!is.matrix(responses_numeric) && !is.data.frame(responses_numeric)) {
    stop("Input must be a matrix or data.frame")
  }
  
  # Convert to data frame and drop NAs
  responses_numeric <- as.data.frame(responses_numeric)
  responses_numeric <- na.omit(responses_numeric)
  
  # Ensure all columns are numeric
  if (!all(sapply(responses_numeric, is.numeric))) {
    stop("All columns must be numeric. Please remove or convert non-numeric variables.")
  }
  
  # Ensure enough observations
  if (nrow(responses_numeric) <= ncol(responses_numeric)) {
    stop("Number of rows must exceed number of columns for stable factor extraction.")
  }
  
  # Load or install 'psych' package
  if (!requireNamespace("psych", quietly = TRUE)) {
    install.packages("psych", quiet = TRUE)
  }
  library(psych)
  
  # Set random seed
  set.seed(seed)
  
  # Perform parallel analysis using both PCA and FA
  result <- psych::fa.parallel(
    responses_numeric,
    fa = "both",             # Run both PCA and PAF
    n.iter = iterations,
    error.bars = TRUE,
    main = "Parallel Analysis: PCA vs PAF"
  )
  
  # Display summary
  cat("\n=== Parallel Analysis Results ===\n")
  cat(sprintf("Data: %d cases, %d items\n", nrow(responses_numeric), ncol(responses_numeric)))
  cat(sprintf("Iterations: %d\n", iterations))
  cat(sprintf("PCA suggests retaining %d components\n", result$ncomp))
  cat(sprintf("PAF suggests retaining %d factors\n", result$nfact))
  
  if (result$ncomp == result$nfact) {
    cat(sprintf("\nConsensus recommendation: Retain %d dimension(s)\n", result$ncomp))
  } else {
    cat("\nNote: PCA and PAF recommend different numbers of dimensions\n")
    cat(" - PCA Suggests:", result$ncomp, "\n")
    cat(" - PAF Suggests:", result$nfact, "\n")
  }
  
  return(invisible(result))
}
parallel_analysis(responses_numeric)

## Parallel analysis suggests that the number of factors =  1  and the number of components =  1 
## 
## === Parallel Analysis Results ===
## Data: 1000 cases, 30 items
## Iterations: 1000
## PCA suggests retaining 1 components
## PAF suggests retaining 1 factors
## 
## Consensus recommendation: Retain 1 dimension(s)

Interpretation Guide for Parallel Analysis Plot

Key Elements of the Plot:

  • Blue Line (Observed Data): Shows the actual eigenvalues from your dataset
  • Red Line (Simulated Data): Shows the average eigenvalues from random datasets with the same dimensions
  • Dashed Line (95th Percentile): Shows the 95th percentile of random eigenvalues

Determining the Number of Components/Factors

  • Where the lines cross:
    • The recommended number of components/factors is where your observed data (blue line) crosses below the simulated data (red line) or its 95th percentile (dashed line)
  • PCA vs. PAF results:
    • The function provides separate recommendations for PCA (components) and PAF (factors)
    • PCA typically suggests slightly more dimensions than PAF
  • Consensus recommendation:
    • When PCA and PAF agree, you can be more confident in the recommendation
    • When they disagree, consider:
      • The theoretical meaningfulness of the additional components
      • The purpose of your analysis (exploration vs. data reduction)
      • The interpretability of the additional dimensions

Additional Considerations:

  • Steep drops in the observed eigenvalues indicate important dimensions
  • Gradual declines suggest less important dimensions
  • Eigenvalues > 1 (Kaiser criterion) may be shown as a reference, but parallel analysis is generally more accurate

Decision Making:

  • When to retain more dimensions:
    • If you have strong theoretical reasons
    • If the additional dimensions are interpretable
    • If you need maximum variance accounted for
  • When to retain fewer dimensions:
    • For more parsimonious solutions
    • When additional dimensions are hard to interpret
    • When you suspect overfitting

Remember that parallel analysis provides a data-driven recommendation, but should be combined with substantive knowledge about your domain.


2.8 Item Analysis

Item analysis is a set of statistical techniques used to evaluate the quality and performance of individual items (questions) in a test or assessment instrument. In the classical approach to item analysis, two key statistics are commonly used to evaluate individual items: the P-value and the point-biserial correlation coefficient.

The P-value represents the proportion of examinees who respond in the keyed (correct) direction and is typically interpreted as a measure of item difficulty. A higher P-value indicates an easier item, while a lower P-value suggests a more difficult one.

The point-biserial correlation coefficient reflects the correlation between an individual item and the total test score (excluding the item itself). This statistic provides an index of the item’s ability to distinguish between high- and low-performing examinees, and is generally referred to as item discrimination.


Purpose of Item Analysis

  • Evaluate item quality: Identify strong and weak items
  • Improve test reliability: Remove or revise problematic items
  • Ensure validity: Confirm items measure the intended construct
  • Balance difficulty: Create tests with appropriate challenge levels
item_analysis <- function(responses){
    # CRITICAL VALUES
    cvpb = 0.20
    cvdl = 0.15
    cvdu = 0.85
    
    require(CTT, warn.conflicts = FALSE, quietly = TRUE)
    (ctt.analysis <- CTT::reliability(responses, itemal = TRUE, NA.Delete = TRUE))
    
    # Mark items that are potentially problematic
    item.analysis <- data.frame(r.pbis = ctt.analysis$pBis,
                                bis = ctt.analysis$bis,
                                item.mean = ctt.analysis$itemMean,
                                alpha.del = ctt.analysis$alphaIfDeleted)
    
    # code provided by Dr. Gordon Brooks
    if (TRUE) {
      item.analysis$check <- 
        ifelse(item.analysis$r.pbis < cvpb |
                 item.analysis$item.mean < cvdl |
                 item.analysis$item.mean > cvdu, "‡", "")
    }
    
    return(item.analysis)
}

dump("item_analysis", file = "item_analysis.R")
knitr::kable(item_analysis(dich_data), 
             align = "c",
             caption = "Item Analysis")
Item Analysis
r.pbis bis item.mean alpha.del check
Item1 0.41791 0.51282 0.433 0.91492
Item2 0.41599 0.49164 0.294 0.91473
Item3 0.44152 0.55491 0.123 0.91420
Item4 0.55461 0.65858 0.379 0.91244
Item5 0.35033 0.42516 0.217 0.91554
Item6 0.58224 0.69065 0.370 0.91195
Item7 0.53247 0.61464 0.208 0.91284
Item8 0.59596 1.34172 0.105 0.91254
Item9 0.54787 0.62785 0.206 0.91261
Item10 0.40716 0.50491 0.146 0.91461
Item11 0.70123 0.80505 0.307 0.90992
Item12 0.70411 0.81855 0.328 0.90983
Item13 0.42275 0.51371 0.163 0.91441
Item14 0.55057 0.63898 0.322 0.91250
Item15 0.43675 0.51171 0.271 0.91435
Item16 0.48072 0.59304 0.120 0.91375
Item17 0.52201 0.60350 0.242 0.91297
Item18 0.47740 0.55562 0.292 0.91371
Item19 0.52336 0.66702 0.478 0.91303
Item20 0.60932 0.76361 0.455 0.91146
Item21 0.33831 0.49773 0.064 0.91543
Item22 0.47149 0.54963 0.265 0.91378
Item23 0.51590 0.63656 0.119 0.91334
Item24 0.07407 0.28518 0.005 0.91706
Item25 0.37787 0.46158 0.425 0.91562
Item26 0.50252 0.59722 0.379 0.91337
Item27 0.60378 0.71468 0.144 0.91209
Item28 0.51865 0.66961 0.100 0.91344
Item29 0.54861 0.67074 0.426 0.91256
Item30 0.55915 0.64676 0.194 0.91247

Column Definitions

Column Meaning
item Item identifier
r.pbis Point-biserial correlation (item-total correlation)
bis Biserial correlation (approximate)
item.mean Mean score on the item (difficulty indicator)
alpha.del Cronbach's alpha if this item is deleted
check Flag indicating potential problem ( means flagged)

Item Analysis Interpretation

  • Overall Reliability (0.91 – 0.92): excellent internal consistency for the test instrument.

  • Item Performance: Most items perform adequately; however, 9 flagged items (‡) require immediate attention due to poor psychometric properties.


Problematic Items Requiring Review

Item r.pbis item.mean alpha.del Issue Recommendation
3 0.44 0.12 0.91420 Very difficult (floor effect) Review or revise wording
8 0.60 0.11 0.91254 Very difficult, despite high discrimination Review difficulty level
10 0.41 0.15 0.91461 Low mean, modest discrimination Check for ambiguity
16 0.48 0.12 0.91375 Very hard, borderline discrimination Consider revising
21 0.34 0.06 0.91543 Extremely difficult, low discrimination Remove or reword
23 0.52 0.12 0.91334 Very difficult, modest contribution Review clarity/content
24 0.07 0.005 0.91706 🚨 Extremely low discrimination and mean Remove immediately
27 0.60 0.14 0.91209 High discrimination, very low mean Retain with caution
28 0.52 0.10 0.91344 Good discrimination, extreme difficulty Consider balance

Key Metrics Summary

Metric Threshold / Ideal Range Flagged / Problematic Items Exemplary Items
Point-Biserial (r.pbis) 0.20 – 0.50 Items 21 (0.34), 24 (0.07), 1 (1.00), 5 (0.35) Items 8 (0.60), 11 (0.70), 12 (0.70), 20 (0.61), 27 (0.60)
Biserial Correlation (bis) > 0.30 Item 24 (0.29), borderline: 5 (0.43), 1 (0.51) Items 8 (0.74), 11 (0.81), 12 (0.82), 20 (0.76), 27 (0.71)
Item Difficulty (Mean) 0.30 – 0.70 Too hard: 3 (0.12), 8 (0.11), 10 (0.15), 16 (0.12), 21 (0.06), 23 (0.12), 24 (0.005), 27 (0.14), 28 (0.10) Balanced: 6 (0.37), 14 (0.32), 19 (0.48), 20 (0.46), 26 (0.38), 29 (0.43)
Alpha if Item Deleted (α.del) Current α ≈ 0.91 Higher α if removed: 24 (0.917), 21 (0.915), 25 (0.916), 5 (0.916) Low impact or reduces α: Items 6 (0.912), 11 (0.910), 12 (0.910), 20 (0.911)

Action Plan

Action Type Items Involved Rationale
Remove Immediately 24 Extremely low mean (0.005) and discrimination (r.pbis = 0.07); increases α
Strongly Consider Removal 21, 10, 11, 12 Low discrimination, very hard/easy, or α would improve if removed
Revise / Reword 3, 5, 8, 16, 23, 28 Floor/ceiling effects, borderline or inconsistent discrimination
Retain with Caution 27 Very low mean but strong discrimination (r.pbis = 0.60)
Retain as Is 6, 11, 12, 14, 19, 20, 26, 29, 30 Good psychometric properties; support scale reliability
Add New Items Introduce items with difficulty in 0.4 – 0.6 range to balance test

Exemplary Items

Item r.pbis Mean α if Deleted Notes
8 0.60 0.11 0.91254 Excellent discrimination; very hard item—retain with caution
11 0.70 0.31 0.90992 Strong item; boosts reliability
12 0.70 0.33 0.90983 Strong item; solid psychometrics
19 0.52 0.48 0.91303 Balanced difficulty and strong r.pbis
20 0.61 0.46 0.91146 Near-ideal item; great contribution
26 0.50 0.38 0.91337 Strong performance across metrics
29 0.55 0.43 0.91256 Excellent balance; retain
30 0.56 0.19 0.91247 Good discrimination; moderately difficult

2.8.1 Item Difficulty

Item difficulty refers to the proportion of students who answer a particular item correctly. This statistic is essential for determining whether an item is appropriate for the ability level of the test-takers.

Items with difficulty values near 0 or 1 offer limited value in distinguishing among students, as they do not effectively discriminate between different ability levels. In contrast, items with difficulty values between 0.3 and 0.7 tend to provide the most information about differences in students’ abilities or trait levels.

However, if a test is specifically designed for an extreme group and a cut score has been established, the optimal item difficulty should align with that cut score (Allen & Yen, 1979).

#' Item Difficulty Analysis with Visualization
#'
#' Calculates and visualizes item difficulty statistics for binary test items,
#' returning a flagged table (ordered by item number) and diagnostic plot.
#'
#' @param dich_data A matrix or data.frame of binary responses (0 = incorrect, 1 = correct)
#' @param plot Logical. Should the diagnostic plot be displayed? (Default = TRUE)
#' @return A data.frame with columns: difficulty (p-value), point-biserial correlation, and check flags
#' @examples 
#' data <- data.frame(item1 = c(1,0,1,1), item2 = c(0,0,1,1))
#' item_difficulty(data)
#' @export
item_difficulty <- function(dich_data, plot = TRUE) {
  # Critical values
  cv <- list(
    # Flagging thresholds
    extreme_lower = 0.15,
    extreme_upper = 0.85,
    optimal_lower = 0.30,
    optimal_upper = 0.70,
    pb_cutoff = 0.20,
    
    # Plotting parameters
    point_cex = 2.5,
    extreme_lwd = 2,
    optimal_lwd = 1.5,
    label_cex = 0.85,
    axis_cex = 0.9
  )
  
  # Input validation
  if (!is.data.frame(dich_data) && !is.matrix(dich_data)) {
    stop("Input must be a data.frame or matrix")
  }
  if (any(!dich_data %in% c(0, 1, NA))) {
    warning("Data contains non-binary values - results may be invalid")
  }
  
  # Calculate statistics
  ctt <- CTT::reliability(dich_data, itemal = TRUE, NA.Delete = TRUE)
  
  # Create results table with item numbers
  results <- data.frame(
    item_number = 1:ctt$nItem,
    difficulty = ctt$itemMean,
    pb_correlation = ctt$pBis,
    check = ifelse(
      ctt$pBis < cv$pb_cutoff | 
      ctt$itemMean < cv$extreme_lower | 
      ctt$itemMean > cv$extreme_upper,
      "‡", ""
    ),
    stringsAsFactors = FALSE
  )
  rownames(results) <- paste("Item", 1:ctt$nItem)
  
  # Generate plot if requested
  if (plot) {
    # Save current graphical settings
    old_par <- par(no.readonly = TRUE)
    on.exit(par(old_par))
    
    # Set up color coding
    item_type <- ifelse(
      results$difficulty < cv$extreme_lower, "too_hard",
      ifelse(
        results$difficulty > cv$extreme_upper, "too_easy",
        ifelse(
          results$pb_correlation < cv$pb_cutoff, "low_pb", "good"
        )
      )
    )
    point_col <- c(
      "too_hard" = "#E76F51",
      "too_easy" = "#1F78B4", 
      "low_pb" = "#E9C46A",
      "good" = "#264653"
    )[item_type]
    
    # Create plot layout
    par(mar = c(5, 4, 4, 4) + 0.1, mgp = c(2.5, 0.8, 0))
    plot(NA, 
         xlim = c(0.5, ctt$nItem + 0.5), 
         ylim = c(0, 1),
         main = "Item Difficulty Analysis\n(with Point-Biserial Correlation)",
         xlab = "", ylab = "Difficulty (p-value)",
         xaxt = "n", yaxt = "n", bty = "o", cex.main = 1.2)
    
    # Add axes
    axis(1, at = 1:ctt$nItem, labels = rownames(results), cex.axis = cv$axis_cex)
    axis(2, at = seq(0, 1, 0.1), las = 1, cex.axis = cv$axis_cex)
    mtext("Item", side = 1, line = 3, cex = 0.9)
    
    # Shaded regions
    rect(par("usr")[1], cv$optimal_lower, par("usr")[2], cv$optimal_upper,
         col = adjustcolor("#264653", alpha.f = 0.15), border = NA)
    rect(par("usr")[1], 0, par("usr")[2], cv$extreme_lower,
         col = adjustcolor("#E76F51", alpha.f = 0.15), border = NA)
    rect(par("usr")[1], cv$extreme_upper, par("usr")[2], 1,
         col = adjustcolor("#1F78B4", alpha.f = 0.15), border = NA)
    
    # Reference lines
    abline(h = cv$extreme_lower, col = "#E76F51", lty = 2, lwd = cv$extreme_lwd)
    abline(h = cv$extreme_upper, col = "#1F78B4", lty = 2, lwd = cv$extreme_lwd)
    abline(h = cv$optimal_lower, col = "#264653", lty = 3, lwd = cv$optimal_lwd)
    abline(h = cv$optimal_upper, col = "#264653", lty = 3, lwd = cv$optimal_lwd)
    
    # Plot points
    points(1:ctt$nItem, results$difficulty,
           pch = 19, cex = cv$point_cex,
           col = point_col, lwd = 1.2)
    text(1:ctt$nItem, results$difficulty,
         labels = 1:ctt$nItem,
         col = "white", cex = cv$label_cex, font = 2)
    
    # Right axis for point-biserial
    par(new = TRUE)
    plot(results$pb_correlation, type = "h", lwd = 2.5,
         col = adjustcolor("gray50", alpha.f = 0.7),
         axes = FALSE, xlab = "", ylab = "", ylim = c(-0.5, 1))
    axis(4, at = seq(-0.5, 1, 0.25), las = 1, cex.axis = cv$axis_cex)
    mtext("Point-Biserial Correlation", side = 4, line = 2.5, cex = 0.9)
    
    # Legend
    legend("topright", 
           legend = c("Too Hard (p < 0.15)", "Too Easy (p > 0.85)", 
                     "Low Discrimination (r_pb < 0.20)", "Good Item"),
           col = c("#E76F51", "#1F78B4", "#E9C46A", "#264653"),
           pch = 19, pt.cex = cv$point_cex, cex = 0.85, bty = "n")
  }
  
  # Return results ordered by item number (original order)
  results <- results[order(results$item_number), ]
  results$check <- format(results$check, justify = "centre")
  return(results[, c("difficulty", "pb_correlation", "check")])  # Exclude item_number from output
}

# Save function to file if needed
dump("item_difficulty", file = "item_difficulty.R")
# Formatted table output
knitr::kable(
  item_difficulty(dich_data),
  align = c("l", "r", "r", "c"),  # Left, Right, Right, Center alignment
  caption = "Item-Total Correlation Analysis",
  digits = 3
)
## You will find additional options and better formatting using itemAnalysis().

Item-Total Correlation Analysis
difficulty pb_correlation check
Item 1 0.433 0.418
Item 2 0.294 0.416
Item 3 0.123 0.442
Item 4 0.379 0.555
Item 5 0.217 0.350
Item 6 0.370 0.582
Item 7 0.208 0.532
Item 8 0.105 0.596
Item 9 0.206 0.548
Item 10 0.146 0.407
Item 11 0.307 0.701
Item 12 0.328 0.704
Item 13 0.163 0.423
Item 14 0.322 0.551
Item 15 0.271 0.437
Item 16 0.120 0.481
Item 17 0.242 0.522
Item 18 0.292 0.477
Item 19 0.478 0.523
Item 20 0.455 0.609
Item 21 0.064 0.338
Item 22 0.265 0.471
Item 23 0.119 0.516
Item 24 0.005 0.074
Item 25 0.425 0.378
Item 26 0.379 0.503
Item 27 0.144 0.604
Item 28 0.100 0.519
Item 29 0.426 0.549
Item 30 0.194 0.559

Interpretation Guide for Item Difficulty Analysis

This function analyzes binary test items (0=incorrect, 1=correct) to evaluate:

  • Item difficulty (p-value)
  • Point-biserial correlation (discrimination index)
  • Flags potential problematic items

Diagnostic Plot Elements

  • Difficulty Values (Left Y-axis)
    • Plotted as colored circles (1-2.5x size)
    • X-axis shows item numbers
  • Point-Biserial Correlations (Right Y-axis)
    • Gray vertical lines show discrimination strength
    • Secondary y-axis on the right side
    • Help interpret difficulty points in relation to discrimination
  • Reference Zones
    • Red zone (p < 0.15): Too difficult
    • Blue zone (p > 0.85): Too easy
    • Teal zone (0.30-0.70): Optimal difficulty range
    • Yellow points: Low discrimination (r_pb < 0.20)

Interpretation Guidelines

Difficulty (p-value)

Range Interpretation Action Consideration
p < 0.15 Very difficult Review for clarity/quality
0.15 ≤ p < 0.30 Moderately difficult May need revision
0.30 ≤ p ≤ 0.70 Optimal difficulty Good items
0.70 < p ≤ 0.85 Moderately easy May be too simple
p > 0.85 Very easy Consider removing

Point-Biserial Correlation

Range Interpretation Action Consideration
r_pb < 0.20 Poor discrimination Strong candidate for revision/removal
0.20 ≤ r_pb < 0.30 Moderate discrimination May need improvement
r_pb ≥ 0.30 Good discrimination Effective item

Flagged Items

Items are flagged (‡) if they meet ANY of: 1. p < 0.15 (too hard) 2. p > 0.85 (too easy) 3. r_pb < 0.20 (poor discrimination)


Decision Framework

  • For High-Stakes Tests:
    • Prioritize items in optimal difficulty range (0.30 - 0.70)
    • Remove or revise items with r_pb < 0.25
  • For Classroom Assessments:
    • Some very easy/hard items may be acceptable
    • Focus on removing poorly discriminating items
  • For Diagnostic Tests:
    • May intentionally include some extreme items
    • Ensure good discrimination throughout

Color Coding in Plot

  • Red: Too difficult (p < 0.15)
  • Blue: Too easy (p > 0.85)
  • Yellow: Low discrimination (r_pb < 0.20)
  • Dark Teal: Good items

The gray vertical lines are the point-biserial correlation bars drawn on the secondary y-axis on the right. They visually show how strongly each item’s response correlates with total test performance (discrimination). They help interpret the difficulty points (blue/red dots) in relation to discrimination without crowding the main plot.


Problematic Items (Too Hard or Low Discrimination)

Item Difficulty r.pbis Notes
Item24 0.005 0.074 🚨 Extremely hard, very poor discrimination — remove
Item21 0.064 0.338 Extremely hard, borderline discrimination — revise
Item5 0.217 0.350 Too easy, low discrimination — revise

Very Hard but Reasonable Discrimination

Item Difficulty r.pbis Notes
Item28 0.100 0.519 Very hard, but discriminates well — retain cautiously
Item8 0.105 0.596 Very hard, strong discriminator — retain cautiously
Item23 0.119 0.516 Difficult, but meaningful — review content clarity
Item16 0.120 0.481 Difficult, decent discrimination — possible revision
Item3 0.123 0.442 Very hard, borderline item — revise or clarify
Item27 0.144 0.604 Difficult but excellent discrimination — retain
Item10 0.146 0.407 Hard with modest discrimination — check clarity
Item13 0.163 0.423 On the edge of acceptable — review wording
Item30 0.194 0.559 Hard but solid performer — retain

Well-Balanced Items

Item Difficulty r.pbis Notes
Item6 0.370 0.582 Ideal range — retain
Item4 0.379 0.555 Strong psychometric quality
Item26 0.379 0.503 Balanced and reliable
Item14 0.322 0.551 Ideal zone — keep
Item12 0.328 0.704 Excellent performer — retain
Item19 0.478 0.523 Great balance — retain
Item20 0.455 0.609 Strong overall — retain
Item29 0.426 0.549 Well-balanced

Acceptable but Monitor

Item Difficulty r.pbis Notes
Item2 0.294 0.416 Acceptable, slightly low diff.
Item11 0.307 0.701 Very strong discrimination
Item7 0.208 0.532 Slightly hard, strong disc.
Item9 0.206 0.548 Good performance
Item15 0.271 0.437 Borderline but solid
Item17 0.242 0.522 Slightly low diff., strong disc.
Item18 0.292 0.477 Approaching ideal
Item22 0.265 0.471 Moderate
Item1 0.433 0.418 Balanced but less standout
Item25 0.425 0.378 Acceptable, slightly lower disc.

Action Plan for Item Revision

Remove Immediately These items show extremely poor psychometric properties and should be removed from the test form.

Item Reason
Item24 Extremely low difficulty and r.pbis — no meaningful measurement
Item5 Too easy and weak discrimination

2.8.2 Item Discrimination

The item discrimination index indicates how well a test item differentiates between individuals with high and low levels of the trait or ability being measured. In essence, it reflects the item's ability to distinguish between stronger and weaker examinees, serving as a measure of its discriminating power.

Items with low discrimination values may be poorly worded, ambiguous, or otherwise flawed, and should be reviewed carefully. Particular attention should be paid to items with negative discrimination indices, as these suggest that lower-performing students are more likely to answer the item correctly than higher-performing students. One possible explanation for a negative value is a scoring or typographical error—such as the correct response not being keyed—leading knowledgeable students to select an unkeyed but accurate option.

The item-discrimination index (Tristan, 1998), \(d_i\) of an item, \(i\) can be calculated as

\[ d_i = \frac{U_i}{n_iU} - \frac{L_i}{n_iL} \]

where

\(\qquad U_i\): The number of student who answered item, \(i\) correctly and have a total test score in the upper range, between the upper 10% and 33% but optimally the upper 27%.

\(\qquad L_i\): The number of student who answered item, \(i\) correctly and have a total test score in the lower range, between the lower 10% and 33% but optimally the lower 27%

\(\qquad n_{iU}\): The number of students in \(U_i\)

\(\qquad n_{iL}\): The number of students in \(L_i\).

#' Item Discrimination Analysis with Visualization
#'
#' Analyzes and visualizes item discrimination using point-biserial correlations
#'
#' @param dich_data A matrix or data.frame of binary responses (0=incorrect, 1=correct)
#' @param plot Logical. Should the diagnostic plot be displayed? (Default = TRUE)
#' @return A data.frame with point-biserial correlations, difficulty, and centered check flags
#' @examples 
#' data <- data.frame(item1 = c(1,0,1,1), item2 = c(0,0,1,1))
#' item.discrimination(data)
#' @export
item_discrimination <- function(dich_data, plot = TRUE) {
  # Input validation
  if (!is.data.frame(dich_data) && !is.matrix(dich_data)) {
    stop("Input must be a data.frame or matrix")
  }
  if (any(!dich_data %in% c(0, 1, NA))) {
    warning("Data contains non-binary values - results may be invalid")
  }

  # Thresholds
  cv <- list(
    min_acceptable = 0.20,
    good = 0.30, 
    excellent = 0.40,
    difficulty_low = 0.15,
    difficulty_high = 0.85
  )

  # Compute stats
  ctt <- CTT::reliability(dich_data, itemal = TRUE, NA.Delete = TRUE)

  # Create results with item numbers and centered check column
  results <- data.frame(
    item_number = 1:ctt$nItem,
    point_biserial = ctt$pBis,
    difficulty = ctt$itemMean,
    check = ifelse(
      ctt$pBis < cv$min_acceptable |
      ctt$itemMean < cv$difficulty_low |
      ctt$itemMean > cv$difficulty_high,
      "‡", ""),
    stringsAsFactors = FALSE
  )
  rownames(results) <- paste("Item", 1:ctt$nItem)

  # Generate plot if requested
  if (plot) {
    # Save graphical settings
    old_par <- par(no.readonly = TRUE)
    on.exit(par(old_par))
    
    # Color coding (for plot only)
    plot_colors <- ifelse(
      results$point_biserial < cv$min_acceptable, "red",
      ifelse(results$point_biserial < cv$excellent, "dodgerblue", "darkgreen")
    )
    
    # Plot setup
    y_min <- max(min(results$point_biserial, na.rm = TRUE) * 0.9, -0.2)
    y_max <- min(max(results$point_biserial, na.rm = TRUE) * 1.1, 1.1)
    
    par(mar = c(5, 4, 4, 2) + 0.1, mgp = c(2.5, 0.8, 0))
    plot(NA,
         xlim = c(0.5, ctt$nItem + 0.5),
         ylim = c(y_min, y_max),
         xlab = "",
         ylab = "Point-Biserial Correlation",
         main = "Item Discrimination Analysis",
         xaxt = "n", bty = "o")
    
    axis(1, at = 1:ctt$nItem, labels = rownames(results))
    
    # Draw circles with colors
    symbols(1:ctt$nItem,
            results$point_biserial,
            circles = rep(0.4, ctt$nItem),
            inches = FALSE,
            bg = plot_colors,
            cex = 25,
            fg = "black",
            add = TRUE)
    
    # Add centered item numbers
    text(1:ctt$nItem,
         results$point_biserial,
         labels = 1:ctt$nItem,
         col = "white", font = 2, cex = 0.85)
    
    # Reference lines
    abline(h = cv$min_acceptable, col = "red", lty = 2, lwd = 1.5)
    abline(h = cv$good, col = "dodgerblue", lty = 3, lwd = 1.5)
    abline(h = cv$excellent, col = "darkgreen", lty = 3, lwd = 1.5)
    
    # Legend
    legend("topright", 
           legend = c("Poor (r < 0.20)", "Good (0.20 ≤ r < 0.40)", "Excellent (r ≥ 0.40)"),
           col = c("red", "dodgerblue", "darkgreen"), 
           pch = 19, pt.cex = 1.5, cex = 0.85, bty = "n")
  }
  
  # Return results ordered by item number (original order)
  results <- results[order(results$item_number), ]
  results$check <- format(results$check, justify = "centre")
  return(results[, c("point_biserial", "difficulty", "check")])  # Exclude item_number from output
}

# Save the function if needed
dump("item_discrimination", file = "item_discrimination_function.R")
# Formatted table output
knitr::kable(
  item_discrimination(dich_data),
  align = c("l", "r", "r", "c"),  # Left, Right, Right, Center alignment
  caption = "Item-Total Correlation Analysis",
  digits = 3
)
## You will find additional options and better formatting using itemAnalysis().

Item-Total Correlation Analysis
point_biserial difficulty check
Item 1 0.418 0.433
Item 2 0.416 0.294
Item 3 0.442 0.123
Item 4 0.555 0.379
Item 5 0.350 0.217
Item 6 0.582 0.370
Item 7 0.532 0.208
Item 8 0.596 0.105
Item 9 0.548 0.206
Item 10 0.407 0.146
Item 11 0.701 0.307
Item 12 0.704 0.328
Item 13 0.423 0.163
Item 14 0.551 0.322
Item 15 0.437 0.271
Item 16 0.481 0.120
Item 17 0.522 0.242
Item 18 0.477 0.292
Item 19 0.523 0.478
Item 20 0.609 0.455
Item 21 0.338 0.064
Item 22 0.471 0.265
Item 23 0.516 0.119
Item 24 0.074 0.005
Item 25 0.378 0.425
Item 26 0.503 0.379
Item 27 0.604 0.144
Item 28 0.519 0.100
Item 29 0.549 0.426
Item 30 0.559 0.194

Notes for Table Interpretation:

  • Items flagged with ‡ may need review because they:
    • Have poor discrimination (point-biserial \(< 0.20\))
    • Are too difficult (\(p < 0.15\))
    • Are too easy (\(p > 0.85\))
  • Interpretation guidelines:
    • Excellent discrimination: \(r \geq 0.40\)
    • Good discrimination: \(0.20 \leq r < 0.40\)
    • Problematic discrimination: \(r < 0.20\)

Item Discrimination Analysis

📊 Discrimination Overview

  • Acceptable Range: 0.20 - 0.50 (point-biserial)
  • Poor Discrimination (< 0.25): 5 items (5, 12, 11, 1, 27)
  • Marginal (0.25 - 0.30): 5 items (20, 3, 17, 8, 16)
  • Good (0.30 - 0.40): 12 items
  • Excellent (> 0.40): 8 items

Problematic Items

These items have low discrimination (r_pb < 0.30) and/or extreme difficulty (too easy > 0.85 or too hard < 0.15):

Item r_pb Difficulty Issue
Item24 0.074 0.005 Extremely hard and non-discriminating — remove
Item21 0.338 0.064 Extremely hard — review or reword
Item5 0.350 0.217 Too easy, low-ish discrimination
Item3 0.442 0.123 Very hard, moderate discrimination — review
Item10 0.407 0.146 Too difficult, borderline discrimination

Difficult but Discriminating

These items are hard but offer strong psychometric value. Review for clarity but consider retaining:

Item r_pb Difficulty Notes
Item8 0.596 0.105 Very difficult but strong discrimination
Item16 0.481 0.120 Very difficult but still useful
Item23 0.516 0.119 Low mean, but good r_pb — monitor
Item28 0.519 0.100 Excellent r_pb despite very low difficulty
Item27 0.604 0.144 Hard but excellent discrimination — retain

Key Observations

  1. Difficulty–Discrimination Relationship
  • Items with extremely low difficulty (e.g., Item24: 0.005, Item8: 0.105) often show weak or misleading discrimination. These may not effectively differentiate between examinees and should be reviewed or removed.
  • Conversely, some difficult items like Item8 and Item27 have high discrimination (r_pb > 0.59), indicating they are informative despite being hard — retain with caution or revise for clarity.
  • Items in the moderate difficulty range (0.30 – 0.70) generally display strong and stable discrimination, making them ideal for most testing contexts (e.g., Items 6, 14, 20, 29).
  1. Flagged Items
  • Items such as *Item24 (r_pb = 0.07, mean = 0.005) and Item21 (r_pb = 0.34, mean = 0.064) are highly problematic — they show floor effects and poor discrimination.
  • Items with low means and r_pb (e.g., Item3, Item10, Item16) should be carefully reviewed for ambiguity, bias, or lack of clarity.
  1. Exemplary Items
  • Items like Item11, Item12, Item20, Item27, and Item29 combine strong discrimination with acceptable difficulty levels, suggesting well-aligned content and wording.
  1. Redundancy & Scale Efficiency
  • Alpha-if-deleted statistics show that removal of certain items (e.g., Item24, Item21) would increase overall reliability. This supports their removal or revision.
  1. Actionable Direction
  • Remove or revise Items 24, 21, and 5.
  • Retain and potentially emphasize Items 6, 14, 20, 27, and 29 in future versions.
  • Balance the test by introducing more moderately difficult items to smooth out the scale.

2.8.3 Item-Total Correlation

The item-total correlation (also known as the point-biserial correlation between an item and the total test score) is a measure of how well an individual item discriminates between high and low performers on the overall test. It is calculated as the correlation between the binary scores (0 or 1) on a specific item and the total test scores (sum of all items).

\[ r_{iX} =\frac{\bar{X_{i}}-\bar{X}}{S_X} \times \sqrt{\frac{p_i}{1-p_i}} \]

where

\(\qquad X_i\) : The mean of the test scores for students who answered the item, \(i\) correctly.

\(\qquad p_i\) : The item difficulty.

\(\qquad \bar{X}\): The mean of \(X\).

\(\qquad S_X\): The standard deviation of \(X\).

The higher the correlation between an item and the test total score for a student, the higher the \(r_{iX}\) or \(d_i\) measures. As a rule of thumb, an item with a discrimination index that is smaller than 0.2 is not acceptable (Allen & Yen, 1979). Both \(r_{iX}\) and \(d_i\) should be positive on any reasonable item, meaning more high scoring students than low scoring students should answer the item correctly. When these measures are negative the item is measuring the opposite of what the rest of the test is measuring. In that case, either an error is made in scoring that particular item or it is poorly worded. For example, the scale for the negatively worded items should be reversed. Items with negative or low \(r_{iX}\) or \(d_i\) valued should be revised or eliminated.

When performance on an item is uncorrelated with performance on all the other items in the test, the item point biserial correlation still will be positive if that particular item’s score is included in the total test score. Therefore, calculating item point biserial correlation for an item without including that particular item’s score in the total score will provide a more accurate item discrimination measure. Items with a negative item discrimination index are counterproductive regarding the purpose of testing.

#' Item–Total Correlation Analysis with Visualization
#'
#' @param dich_data matrix/data.frame of 0/1/NA
#' @param method "bis" (biserial) or "pBis" (point-biserial)
#' @param plot logical, draw figure (default TRUE)
#' @param flag_style one of c("halo","fill","rug","guide","shape","none")
#' @param show_difficulty logical, overlay item difficulty on right axis
#' @return data.frame with correlation, difficulty, and check flag
#' @export
test_item_total <- function(
  dich_data,
  method = "bis",
  plot = TRUE,
  flag_style = c("halo","fill","rug","guide","shape","none"),
  show_difficulty = TRUE
) {
  ## ---- Validation ----
  if (!is.data.frame(dich_data) && !is.matrix(dich_data)) {
    stop("Input must be a data.frame or matrix")
  }
  A <- as.matrix(dich_data)
  if (any(!is.na(A) & !A %in% c(0,1))) {
    warning("Data contains non-binary values - results may be invalid")
  }
  method <- match.arg(method, c("bis","pBis"))
  flag_style <- match.arg(flag_style)

  ## ---- Thresholds ----
  cv <- list(
    min_acceptable = 0.20,
    good           = 0.30,
    excellent      = 0.40,
    difficulty_low = 0.15,
    difficulty_high= 0.85
  )

  ## ---- Stats ----
  ctt  <- CTT::reliability(dich_data, itemal = TRUE, NA.Delete = TRUE)
  corr <- if (method == "bis") ctt$bis else ctt$pBis

  results <- data.frame(
    item_number = seq_len(ctt$nItem),
    correlation = corr,
    difficulty  = ctt$itemMean,
    check = ifelse(corr < cv$min_acceptable |
                     ctt$itemMean < cv$difficulty_low |
                     ctt$itemMean > cv$difficulty_high, "‡", ""),
    stringsAsFactors = FALSE
  )
  rownames(results) <- if (!is.null(colnames(dich_data))) colnames(dich_data) else paste("Item", results$item_number)

  if (!plot) return(results[, c("correlation","difficulty","check")])

  ## ===========================
  ## Publication-ready base-R plot
  ## ===========================
  old_par <- par(no.readonly = TRUE); on.exit(par(old_par), add = TRUE)

  # Labels & positions
  item_pos    <- results$item_number
  item_labels <- rownames(results)

  # Palette (Okabe–Ito)
  COL_LOW  <- "darkred"; COL_MID <- "darkblue"; COL_HIGH <- "darkgreen"
  COL_DIFF <- "gray40";  COL_GRID <- "gray90"

  point_cols <- ifelse(results$correlation < cv$min_acceptable, COL_LOW,
                       ifelse(results$correlation < cv$excellent, COL_MID, COL_HIGH))

  # Limits (auto headroom)
  y_min <- 0
  y_max <- max(results$correlation, cv$excellent, na.rm = TRUE)
  y_max <- ceiling((y_max + 0.05) * 10) / 10

  # Compact margins
  par(
    oma = c(3.2, 1.0, 3.0, if (show_difficulty) 3.2 else 1.0),   # outer margins
    mar = c(2.0, 4.8, 1.2, if (show_difficulty) 4.2 else 1.2),   # inner margins
    mgp = c(2.0, 0.6, 0)                                        # axis spacing
  )

  # Base canvas
  plot(NA, xlim = c(0.5, ctt$nItem + 0.5), ylim = c(y_min, y_max),
       xlab = "", ylab = "", xaxt = "n", yaxt = "n", bty = "n")

  # Grid
  abline(h = seq(0, y_max, by = 0.1), col = COL_GRID, lwd = 1)

  # Reference lines
  abline(h = cv$min_acceptable, col = COL_LOW,  lty = 2, lwd = 1.6)
  abline(h = cv$excellent,      col = COL_HIGH, lty = 3, lwd = 1.6)

  # Base points + labels (LEFT axis active)
  points(item_pos, results$correlation, pch = 21, bg = "white", col = point_cols, lwd = 2, cex = 2.3)
  text(item_pos, results$correlation, labels = item_pos, col = point_cols, font = 2, cex = 0.75)

  # Flags (draw BEFORE par(new=TRUE), so they use left-axis coords)
  idx_flag <- which(results$check == "‡")
  if (length(idx_flag)) {
    if (flag_style == "halo") {
      points(item_pos[idx_flag], results$correlation[idx_flag],
             pch = 21, bg = NA, col = "yellow", lwd = 3.5, cex = 2.8)
    } else if (flag_style == "fill") {
      bg_all <- rep("white", length(item_pos))
      bg_all[idx_flag] <- adjustcolor(COL_LOW, 0.45)
      points(item_pos, results$correlation, pch = 21, bg = bg_all, col = point_cols, lwd = 2, cex = 2)
    } else if (flag_style == "rug") {
      segments(x0 = item_pos[idx_flag], x1 = item_pos[idx_flag],
               y0 = par("usr")[3] + 0.01 * diff(par("usr")[3:4]),
               y1 = par("usr")[3] + 0.05 * diff(par("usr")[3:4]),
               col = COL_LOW, lwd = 2)
    } else if (flag_style == "guide") {
      abline(v = item_pos[idx_flag], col = adjustcolor(COL_LOW, 0.25), lwd = 2, lty = 1)
    } else if (flag_style == "shape") {
      points(item_pos[idx_flag], results$correlation[idx_flag],
             pch = 23, bg = adjustcolor(COL_LOW, 0.35), col = COL_LOW, cex = 2, lwd = 1.6)
    }
  }

  # Axes (left + bottom)
  axis(2, las = 1)
  mtext(paste(ifelse(method == "bis","Biserial","Point-Biserial"), "Correlation"), side = 2, line = 2.8)
  axis(1, at = item_pos, labels = item_labels, las = 2, cex.axis = 0.8)

  # Optional difficulty overlay on RIGHT axis
  if (show_difficulty) {
    par(new = TRUE)
    plot(item_pos, results$difficulty, type = "l", lty = 3, lwd = 2, col = COL_DIFF,
         axes = FALSE, xlab = "", ylab = "", xlim = c(0.5, ctt$nItem + 0.5), ylim = c(0, 1))
    points(item_pos, results$difficulty, pch = 16, cex = 0.9, col = COL_DIFF)
    axis(4, las = 1)
    mtext("Item Difficulty (proportion correct)", side = 4, line = 2.6)
  }

  # Legend
  leg_labels <- c(sprintf("Low (< %.2f)", cv$min_acceptable),
                  sprintf("Good (%.2f–%.2f)", cv$min_acceptable, cv$excellent - .01),
                  sprintf("Excellent (≥ %.2f)", cv$excellent))
  leg_cols <- c(COL_LOW, COL_MID, COL_HIGH)
  leg_pch  <- c(21, 21, 21)
  leg_lty  <- c(NA, NA, NA)
  leg_lwd  <- c(NA, NA, NA)

  if (show_difficulty) {
    leg_labels <- c(leg_labels, "Difficulty")
    leg_cols   <- c(leg_cols, COL_DIFF)
    leg_pch    <- c(leg_pch, 16)
    leg_lty    <- c(leg_lty, 2)
    leg_lwd    <- c(leg_lwd, 2)
  }
  if (length(idx_flag) && flag_style != "none") {
    flag_key <- switch(flag_style,
      halo  = list(lbl = "Flagged (halo)",   col = "yellow", pch = 21, lty = NA, lwd = 3),
      fill  = list(lbl = "Flagged (filled)", col = "yellow", pch = 21, lty = NA, lwd = 2),
      rug   = list(lbl = "Flagged (rug)",    col = "yellow", pch = NA,  lty = 1,  lwd = 2),
      guide = list(lbl = "Flagged (guide)",  col = "yellow", pch = NA,  lty = 1,  lwd = 2),
      shape = list(lbl = "Flagged (diamond)",col = "yellow", pch = 23,  lty = NA, lwd = 2)
    )
    leg_labels <- c(leg_labels, flag_key$lbl)
    leg_cols   <- c(leg_cols,   flag_key$col)
    leg_pch    <- c(leg_pch,    flag_key$pch)
    leg_lty    <- c(leg_lty,    flag_key$lty)
    leg_lwd    <- c(leg_lwd,    flag_key$lwd)
  }

  legend("topright", inset = 0.02, bty = "n", pt.cex = 2,
         legend = leg_labels, col = leg_cols, pch = leg_pch,
         lty = leg_lty, lwd = leg_lwd, ncol = 1, cex = 1)

  # Titles (outer margin) + compact footnote
  mtext(sprintf("Item–Total %s Correlations", ifelse(method=="bis","Biserial","Point-Biserial")),
        side = 3, outer = TRUE, line = 1.2, cex = 1.1, font = 2)
  mtext(sprintf("nItems = %d  |  thresholds: low < %.2f; excellent ≥ %.2f",
                ctt$nItem, cv$min_acceptable, cv$excellent),
        side = 3, outer = TRUE, line = 0.11, cex = 0.9)
  mtext("Item Number", side = 1, outer = TRUE, line = 1.6)

  if (length(idx_flag)) {
    mtext("‡ Flagged items: correlation < 0.20 or difficulty outside [.15, .85]",
          side = 3, line = .1, cex = 0.75)
  }

  invisible(results[, c("correlation","difficulty","check")])
}

# Save the function
dump("test_item_total", file = "item_total_correlation_function.R")
# Formatted table output
knitr::kable(
  test_item_total(dich_data),
  align = c("l", "r", "r", "c"),  # Left, Right, Right, Center alignment
  caption = "Item-Total Correlation Analysis",
  digits = 3
)

Item-Total Correlation Analysis
correlation difficulty check
Item1 0.513 0.433
Item2 0.492 0.294
Item3 0.555 0.123
Item4 0.659 0.379
Item5 0.425 0.217
Item6 0.691 0.370
Item7 0.615 0.208
Item8 1.342 0.105
Item9 0.628 0.206
Item10 0.505 0.146
Item11 0.805 0.307
Item12 0.819 0.328
Item13 0.514 0.163
Item14 0.639 0.322
Item15 0.512 0.271
Item16 0.593 0.120
Item17 0.603 0.242
Item18 0.556 0.292
Item19 0.667 0.478
Item20 0.764 0.455
Item21 0.498 0.064
Item22 0.550 0.265
Item23 0.637 0.119
Item24 0.285 0.005
Item25 0.462 0.425
Item26 0.597 0.379
Item27 0.715 0.144
Item28 0.670 0.100
Item29 0.671 0.426
Item30 0.647 0.194

Notes for Table Interpretation:

Items are flagged (‡) if they meet any of these criteria:

  • Discrimination < 0.20 (poor at distinguishing skill levels)
  • Difficulty < 0.15 (too hard)
  • Difficulty > 0.85 (too easy)

The dashed gray line in the plot represents the item difficulty for each item. It helps you compare item difficulty alongside the correlation values in the same plot for a richer item analysis. Item Difficulty is the proportion of examinees who answered the item correctly (the item's "p-value"). The secondary y-axis on the right shows the scale for item difficulty (ranging from 0 to 1). This allows you to simultaneously see how item discrimination (correlation) and item difficulty relate, helping to understand if items are too hard/easy or poorly discriminating.

Biserial Correlation Analysis Report

Overall Discrimination Strength Thresholds:

  • < 0.30: Poor discrimination
  • 0.30 - 0.39: Adequate
  • 0.40 - 0.49: Good
  • ≥ 0.50: Excellent

Interpretation of Biserial Correlation and Difficulty

Item Biserial Difficulty Interpretation
Item24 0.29 0.005 🚨 Very low biserial and near-zero mean → Not discriminating; remove immediately.
Item5 0.43 0.217 Borderline easy, modest discrimination → May benefit from increased difficulty.
Item25 0.46 0.425 Within ideal range → Solid item worth retaining.
Item2 0.49 0.294 Just below optimal difficulty, strong biserial → Reliable and informative item.
Item21 0.50 0.064 Very difficult but moderate biserial → Too hard for most; monitor or revise.
Item10 0.50 0.146 Low mean but decent biserial → Review clarity or complexity.
Item15 0.51 0.271 Good biserial, slightly difficult → Functioning well.
Item1 0.51 0.433 Solid item with near-ideal difficulty → Retain.
Item13 0.51 0.163 Moderate biserial, high difficulty → Consider revision for accessibility.
Item22 0.55 0.265 Strong biserial, ideal difficulty → High-performing item.

Actionable Recommendations

  • Remove or Replace
    • Item24: Extremely low difficulty (0.005) and poor discrimination → offers no meaningful differentiation.
    • Item21: Very difficult (0.064), low instructional value unless targeting high-performers only.
  • Revise for Clarity or Difficulty
    • Item5: Easy (0.217), borderline discrimination → consider increasing challenge or adjusting distractors.
    • Item10: Low mean (0.146), modest biserial → review for ambiguity or excessive complexity.
    • Item13: Difficult (0.163), moderate biserial → may need rewording to improve accessibility.
    • Item2: Near-threshold difficulty (0.294) — could be strengthened for balance.
  • Retain Strong Items
    • Items 1, 15, 22, 25: Well-functioning based on biserial and difficulty.
    • Item2: Reliable, though bordering the low end of ideal difficulty.
  • Balance the Test
    • Aim for a range of item difficulties (0.3 – 0.7) to ensure score spread across ability levels.
    • Replace extremely easy or hard items with moderately challenging alternatives.
  • Ongoing Monitoring
    • Track performance of revised items in future administrations.
    • Consider conducting cognitive interviews or distractor analyses for revised items.
# Remove items 24, 21, 5 (based on your analysis)
items_to_remove <- c(24, 21, 5)
clean_data <- dich_data[, -items_to_remove]


# 1. FIRST CONVERT TO DATA FRAME (if matrix/array)
if(!is.data.frame(clean_data)) {
  clean_data <- as.data.frame(clean_data)
}

# 2. NOW RUN THE ANALYSIS PROPERLY
library(psych)
library(dplyr)

alpha_results <- clean_data %>% 
  # Convert all columns to numeric (safe conversion)
  mutate(across(everything(), ~as.numeric(as.character(.)))) %>% 
  # Calculate reliability
  psych::alpha(check.keys = TRUE)

# 3. VIEW RESULTS
print(alpha_results$total)  # Cronbach's alpha
##  raw_alpha std.alpha G6(smc) average_r    S/N       ase    mean      sd
##     0.9161   0.91837 0.92018   0.29413 11.251 0.0037458 0.27015 0.24014
##  median_r
##   0.28929
#View(alpha_results$item.stats)  # Item-level stats

2.9 Distractor/Option Analysis

Distractor analysis evaluates the effectiveness of incorrect answer choices (distractors) in multiple-choice items by examining how different ability groups select each option. Here’s a structured breakdown:

Distractor Analysis Process

  • Group Examinees by Ability
    • Action: Divide test-takers into lower, middle, and upper thirds based on total test scores.
    • Rationale: Ensures comparisons across distinct performance levels.
  • Calculate Selection Percentages
    • Action: For each option (correct answer and distractors), compute the percentage of examinees in each ability group who selected it.

Example:

  • Correct Option (A):
    • 80% of upper group
    • 30% of lower group
  • Distractor (B):
    • 10% of upper group
    • 50% of lower group

Evaluate Distractor Effectiveness

  • Effective Distractors:
    • Attract more lower-ability examinees than higher-ability
    • Have similar selection rates among distractors (e.g., 20%, 25%, 30% in lower group)
  • Ineffective Distractors:
    • Negligible selection (< 5% in all groups) → Too obvious or irrelevant
    • Attract higher-ability examinees more than lower-ability → Likely misleading or ambiguous
#' Perform Comprehensive Distractor Analysis
#'
#' @param response_data Data.frame of student responses
#' @param key Answer key (vector)
#' @param problematic_threshold Threshold for flagging overly-selected distractors (default: 30)
#' @param correct_threshold Threshold for flagging low correct percentage (default: 40)
#' @param low_threshold Minimum percentage for functional distractors (default: 5)
#' @param easy_threshold Threshold for suspiciously high correct percentage (default: 90)
#' @param print_report Whether to print the problematic items report (default: TRUE)
#' @return List containing analysis results with named problematic items

analyze_distractors <- function(
    response_data, 
    key, 
    problematic_threshold = 30,
    correct_threshold = 40,
    low_threshold = 5,
    easy_threshold = 90,
    print_report = TRUE
) {
  
  # Convert inputs to consistent format
  response_data <- data.frame(
    lapply(response_data, as.character), 
    stringsAsFactors = FALSE
  )
  key <- as.character(key)
  
  # Initialize results storage
  item_stats <- list()
  problematic_items <- list()
  
  # Analyze each item
  for (i in seq_len(ncol(response_data))) {
    item_name <- colnames(response_data)[i]
    item_responses <- response_data[[i]]
    item_key <- key[i]
    
    # Calculate response frequencies
    freq_table <- table(factor(item_responses, levels = unique(c(item_key, LETTERS[1:5]))))
    pct_table <- round(prop.table(freq_table) * 100, 1)
    
    # Store item statistics
    item_stats[[item_name]] <- pct_table
    
    # Get correct answer percentage
    correct_pct <- ifelse(item_key %in% names(pct_table), pct_table[item_key], 0)
    
    # Identify problematic distractors (too high)
    distractor_pcts <- pct_table[names(pct_table) != item_key]
    problematic_distractors <- distractor_pcts[distractor_pcts > problematic_threshold]
    
    # Identify non-functional distractors (too low)
    non_functional_distractors <- distractor_pcts[distractor_pcts < low_threshold]
    
    # Flag item if any of these conditions are met:
    # 1. Correct answer % too low
    # 2. Correct answer % suspiciously high
    # 3. Has problematic distractors (too high)
    # 4. Has non-functional distractors (too low)
    if (correct_pct < correct_threshold || 
        correct_pct > easy_threshold ||
        length(problematic_distractors) > 0 || 
        length(non_functional_distractors) > 0) {
      
      problematic_items[[item_name]] <- list(
        item_name = item_name,
        correct_pct = correct_pct,
        problematic_distractors = problematic_distractors,
        non_functional_distractors = non_functional_distractors,
        is_too_hard = correct_pct < correct_threshold,
        is_too_easy = correct_pct > easy_threshold,
        problem_type = if (correct_pct < correct_threshold && length(problematic_distractors) > 0) {
          "hard_and_bad_distractors"
        } else if (correct_pct < correct_threshold && length(non_functional_distractors) > 0) {
          "hard_and_non_functional"
        } else if (correct_pct > easy_threshold) {
          "too_easy"
        } else if (length(problematic_distractors) > 0) {
          "bad_distractors"
        } else {
          "non_functional_distractors"
        }
      )
    }
  }
  
  # Create results object
  results <- list(
    item_stats = item_stats,
    problematic_items = problematic_items,
    thresholds = list(
      problematic = problematic_threshold,
      correct = correct_threshold,
      low = low_threshold,
      easy = easy_threshold
    )
  )
  
  # Print report if requested
  if (print_report) {
    print_problematic_items(results)
  }
  
  return(results)
}

#' Internal function to print problematic items report
#' @param analysis_results Analysis results from analyze_distractors
#' @keywords internal
print_problematic_items <- function(analysis_results) {
  if (length(analysis_results$problematic_items) > 0) {
    cat("\n======= PROBLEMATIC ITEMS DETAILED REPORT =======\n")
    cat("-------------------------------------------------\n")
    
    for (item in names(analysis_results$problematic_items)) {
      p <- analysis_results$problematic_items[[item]]
      
      cat(sprintf("\nITEM: %s\n", p$item_name))
      cat(sprintf("Correct Answer: %.1f%% %s\n", 
                  p$correct_pct,
                  ifelse(p$is_too_hard, "(TOO LOW)", 
                         ifelse(p$is_too_easy, "(TOO HIGH)", "(OK)"))))
      
      if (length(p$problematic_distractors) > 0) {
        cat("Overly Selected Distractors (>", analysis_results$thresholds$problematic, "%):\n")
        for (dist in names(p$problematic_distractors)) {
          cat(sprintf(" - %s: %.1f%%\n", dist, p$problematic_distractors[dist]))
        }
      }
      
      if (length(p$non_functional_distractors) > 0) {
        cat("Non-Functional Distractors (<", analysis_results$thresholds$low, "%):\n")
        for (dist in names(p$non_functional_distractors)) {
          cat(sprintf(" - %s: %.1f%%\n", dist, p$non_functional_distractors[dist]))
        }
      }
      
      # Diagnosis and recommendations
      cat("Diagnosis: ")
      switch(p$problem_type,
             "hard_and_bad_distractors" = cat("Too difficult AND has bad distractors\n"),
             "hard_and_non_functional" = cat("Too difficult AND has non-functional distractors\n"),
             "too_easy" = cat("Suspiciously easy (possible answer leakage)\n"),
             "bad_distractors" = cat("Contains overly attractive distractors\n"),
             "non_functional_distractors" = cat("Has non-functional distractors\n"))
      
      cat("Suggested Actions:\n")
      switch(p$problem_type,
             "hard_and_bad_distractors" = cat(" - Complete revision needed\n - Check wording and distractors\n"),
             "hard_and_non_functional" = cat(" - Make item easier\n - Improve or replace weak distractors\n"),
             "too_easy" = cat(" - Verify answer key\n - Increase difficulty\n"),
             "bad_distractors" = cat(" - Revise problematic options\n"),
             "non_functional_distractors" = cat(" - Replace rarely-chosen distractors\n"))
      cat("--------------------------------------------------\n")
    }
    cat("\n=================== END REPORT ===================\n")
  } else {
    cat("\nNo problematic items found - test appears well constructed!\n")
  }
}
# Create consistent answer key where ALL correct answers are "E"
answer_key_list <- rep("E", ncol(responses_letter))  # All items have "E" as correct answer
names(answer_key_list) <- colnames(responses_letter)  # Optional: name the vector

# 2. Run analysis
set.seed(42)
distractor_analysis <- analyze_distractors(responses_letter, answer_key_list)
## 
## ======= PROBLEMATIC ITEMS DETAILED REPORT =======
## -------------------------------------------------
## 
## ITEM: Item1
## Correct Answer: 43.3% (OK)
## Non-Functional Distractors (< 5 %):
##  - B: 3.3%
## Diagnosis: Has non-functional distractors
## Suggested Actions:
##  - Replace rarely-chosen distractors
## --------------------------------------------------
## 
## ITEM: Item2
## Correct Answer: 29.4% (TOO LOW)
## Overly Selected Distractors (> 30 %):
##  - B: 30.1%
## Non-Functional Distractors (< 5 %):
##  - C: 3.5%
## Diagnosis: Too difficult AND has bad distractors
## Suggested Actions:
##  - Complete revision needed
##  - Check wording and distractors
## --------------------------------------------------
## 
## ITEM: Item3
## Correct Answer: 12.3% (TOO LOW)
## Diagnosis: Has non-functional distractors
## Suggested Actions:
##  - Replace rarely-chosen distractors
## --------------------------------------------------
## 
## ITEM: Item4
## Correct Answer: 37.9% (TOO LOW)
## Diagnosis: Has non-functional distractors
## Suggested Actions:
##  - Replace rarely-chosen distractors
## --------------------------------------------------
## 
## ITEM: Item5
## Correct Answer: 21.7% (TOO LOW)
## Diagnosis: Has non-functional distractors
## Suggested Actions:
##  - Replace rarely-chosen distractors
## --------------------------------------------------
## 
## ITEM: Item6
## Correct Answer: 37.0% (TOO LOW)
## Non-Functional Distractors (< 5 %):
##  - D: 4.2%
## Diagnosis: Too difficult AND has non-functional distractors
## Suggested Actions:
##  - Make item easier
##  - Improve or replace weak distractors
## --------------------------------------------------
## 
## ITEM: Item7
## Correct Answer: 20.8% (TOO LOW)
## Diagnosis: Has non-functional distractors
## Suggested Actions:
##  - Replace rarely-chosen distractors
## --------------------------------------------------
## 
## ITEM: Item8
## Correct Answer: 10.5% (TOO LOW)
## Overly Selected Distractors (> 30 %):
##  - B: 40.1%
## Non-Functional Distractors (< 5 %):
##  - C: 3.3%
## Diagnosis: Too difficult AND has bad distractors
## Suggested Actions:
##  - Complete revision needed
##  - Check wording and distractors
## --------------------------------------------------
## 
## ITEM: Item9
## Correct Answer: 20.6% (TOO LOW)
## Overly Selected Distractors (> 30 %):
##  - C: 32.2%
## Non-Functional Distractors (< 5 %):
##  - D: 2.0%
## Diagnosis: Too difficult AND has bad distractors
## Suggested Actions:
##  - Complete revision needed
##  - Check wording and distractors
## --------------------------------------------------
## 
## ITEM: Item10
## Correct Answer: 14.6% (TOO LOW)
## Diagnosis: Has non-functional distractors
## Suggested Actions:
##  - Replace rarely-chosen distractors
## --------------------------------------------------
## 
## ITEM: Item11
## Correct Answer: 30.7% (TOO LOW)
## Overly Selected Distractors (> 30 %):
##  - D: 31.0%
## Diagnosis: Too difficult AND has bad distractors
## Suggested Actions:
##  - Complete revision needed
##  - Check wording and distractors
## --------------------------------------------------
## 
## ITEM: Item12
## Correct Answer: 32.8% (TOO LOW)
## Overly Selected Distractors (> 30 %):
##  - A: 34.5%
## Non-Functional Distractors (< 5 %):
##  - B: 4.3%
## Diagnosis: Too difficult AND has bad distractors
## Suggested Actions:
##  - Complete revision needed
##  - Check wording and distractors
## --------------------------------------------------
## 
## ITEM: Item13
## Correct Answer: 16.3% (TOO LOW)
## Overly Selected Distractors (> 30 %):
##  - A: 34.5%
## Diagnosis: Too difficult AND has bad distractors
## Suggested Actions:
##  - Complete revision needed
##  - Check wording and distractors
## --------------------------------------------------
## 
## ITEM: Item14
## Correct Answer: 32.2% (TOO LOW)
## Overly Selected Distractors (> 30 %):
##  - C: 32.3%
## Diagnosis: Too difficult AND has bad distractors
## Suggested Actions:
##  - Complete revision needed
##  - Check wording and distractors
## --------------------------------------------------
## 
## ITEM: Item15
## Correct Answer: 27.1% (TOO LOW)
## Overly Selected Distractors (> 30 %):
##  - D: 34.2%
## Diagnosis: Too difficult AND has bad distractors
## Suggested Actions:
##  - Complete revision needed
##  - Check wording and distractors
## --------------------------------------------------
## 
## ITEM: Item16
## Correct Answer: 12.0% (TOO LOW)
## Overly Selected Distractors (> 30 %):
##  - B: 59.5%
## Diagnosis: Too difficult AND has bad distractors
## Suggested Actions:
##  - Complete revision needed
##  - Check wording and distractors
## --------------------------------------------------
## 
## ITEM: Item17
## Correct Answer: 24.2% (TOO LOW)
## Overly Selected Distractors (> 30 %):
##  - A: 33.4%
## Non-Functional Distractors (< 5 %):
##  - C: 0.0%
## Diagnosis: Too difficult AND has bad distractors
## Suggested Actions:
##  - Complete revision needed
##  - Check wording and distractors
## --------------------------------------------------
## 
## ITEM: Item18
## Correct Answer: 29.2% (TOO LOW)
## Overly Selected Distractors (> 30 %):
##  - A: 47.0%
## Non-Functional Distractors (< 5 %):
##  - D: 1.9%
## Diagnosis: Too difficult AND has bad distractors
## Suggested Actions:
##  - Complete revision needed
##  - Check wording and distractors
## --------------------------------------------------
## 
## ITEM: Item20
## Correct Answer: 45.5% (OK)
## Overly Selected Distractors (> 30 %):
##  - C: 31.0%
## Non-Functional Distractors (< 5 %):
##  - D: 2.7%
## Diagnosis: Contains overly attractive distractors
## Suggested Actions:
##  - Revise problematic options
## --------------------------------------------------
## 
## ITEM: Item21
## Correct Answer: 6.4% (TOO LOW)
## Overly Selected Distractors (> 30 %):
##  - C: 35.9%
## Diagnosis: Too difficult AND has bad distractors
## Suggested Actions:
##  - Complete revision needed
##  - Check wording and distractors
## --------------------------------------------------
## 
## ITEM: Item22
## Correct Answer: 26.5% (TOO LOW)
## Overly Selected Distractors (> 30 %):
##  - A: 38.0%
## Non-Functional Distractors (< 5 %):
##  - D: 3.8%
## Diagnosis: Too difficult AND has bad distractors
## Suggested Actions:
##  - Complete revision needed
##  - Check wording and distractors
## --------------------------------------------------
## 
## ITEM: Item23
## Correct Answer: 11.9% (TOO LOW)
## Overly Selected Distractors (> 30 %):
##  - D: 59.6%
## Diagnosis: Too difficult AND has bad distractors
## Suggested Actions:
##  - Complete revision needed
##  - Check wording and distractors
## --------------------------------------------------
## 
## ITEM: Item24
## Correct Answer: 0.5% (TOO LOW)
## Overly Selected Distractors (> 30 %):
##  - A: 33.8%
##  - C: 40.3%
## Diagnosis: Too difficult AND has bad distractors
## Suggested Actions:
##  - Complete revision needed
##  - Check wording and distractors
## --------------------------------------------------
## 
## ITEM: Item26
## Correct Answer: 37.9% (TOO LOW)
## Non-Functional Distractors (< 5 %):
##  - D: 0.5%
## Diagnosis: Too difficult AND has non-functional distractors
## Suggested Actions:
##  - Make item easier
##  - Improve or replace weak distractors
## --------------------------------------------------
## 
## ITEM: Item27
## Correct Answer: 14.4% (TOO LOW)
## Overly Selected Distractors (> 30 %):
##  - D: 57.8%
## Non-Functional Distractors (< 5 %):
##  - C: 1.6%
## Diagnosis: Too difficult AND has bad distractors
## Suggested Actions:
##  - Complete revision needed
##  - Check wording and distractors
## --------------------------------------------------
## 
## ITEM: Item28
## Correct Answer: 10.0% (TOO LOW)
## Overly Selected Distractors (> 30 %):
##  - A: 38.4%
##  - D: 44.9%
## Non-Functional Distractors (< 5 %):
##  - B: 0.5%
## Diagnosis: Too difficult AND has bad distractors
## Suggested Actions:
##  - Complete revision needed
##  - Check wording and distractors
## --------------------------------------------------
## 
## ITEM: Item29
## Correct Answer: 42.6% (OK)
## Non-Functional Distractors (< 5 %):
##  - C: 1.3%
##  - D: 4.8%
## Diagnosis: Has non-functional distractors
## Suggested Actions:
##  - Replace rarely-chosen distractors
## --------------------------------------------------
## 
## ITEM: Item30
## Correct Answer: 19.4% (TOO LOW)
## Diagnosis: Has non-functional distractors
## Suggested Actions:
##  - Replace rarely-chosen distractors
## --------------------------------------------------
## 
## =================== END REPORT ===================
#' Visualize Distractor Analysis Results with Comprehensive Markers
#'
#' @param analysis_results Output from analyze_distractors()
#' @param key Answer key (vector)
#' @param items_to_plot Character vector of item names to visualize
#' @param problematic_threshold Threshold for flagging distractors (default: 30)
#' @param correct_threshold Threshold for flagging low correct % (default: 40)
#' @param low_threshold Minimum % for functional distractors (default: 5)
#' @param easy_threshold Threshold for suspiciously easy items (default: 90)
#' @param max_facet_cols Maximum number of columns in facet wrap (default: 3)
#' @return A ggplot object with enhanced warning markers
plot_distractor_analysis <- function(analysis_results, 
                                   key,
                                   items_to_plot = NULL,
                                   problematic_threshold = 30,
                                   correct_threshold = 40,
                                   low_threshold = 5,
                                   easy_threshold = 90,
                                   hard_threshold = 20,
                                   max_facet_cols = 3) {
  
  # Load required packages
  if (!requireNamespace("ggplot2", quietly = TRUE)) install.packages("ggplot2")
  if (!requireNamespace("dplyr", quietly = TRUE)) install.packages("dplyr")
  library(ggplot2)
  library(dplyr)
  
  # Validate inputs
  if (is.null(items_to_plot)) {
    items_to_plot <- names(analysis_results$item_stats)
  }
  
  # Prepare plot data with clear problem classification
  plot_data <- do.call(rbind, lapply(items_to_plot, function(item) {
    stats <- analysis_results$item_stats[[item]]
    item_key <- key[which(names(analysis_results$item_stats) == item)]
    
    # Initialize problem info
    problem_info <- list(
      problematic = character(0),
      non_functional = character(0),
      is_too_hard = FALSE,
      is_too_easy = FALSE
    )
    
    # Check if item is problematic
    if (!is.null(analysis_results$problematic_items) && 
        item %in% names(analysis_results$problematic_items)) {
      p_item <- analysis_results$problematic_items[[item]]
      
      if (!is.null(p_item$problematic_distractors)) {
        problem_info$problematic <- names(p_item$problematic_distractors)
      }
      if (!is.null(p_item$non_functional_distractors)) {
        problem_info$non_functional <- names(p_item$non_functional_distractors)
      }
      if (!is.null(p_item$is_too_hard)) {
        problem_info$is_too_hard <- p_item$is_too_hard
      }
      if (!is.null(p_item$is_too_easy)) {
        problem_info$is_too_easy <- p_item$is_too_easy
      }
    }
    
    # Create data frame
    df <- data.frame(
      Item = factor(item, levels = items_to_plot),
      Option = factor(names(stats), levels = unique(names(stats))),
      Percentage = as.numeric(stats),
      IsCorrect = names(stats) == item_key,
      stringsAsFactors = FALSE
    )
    
    # Classify problem types with clearer logic
    df$ProblemType <- "normal"
    
    # Mark correct answers first
    df$ProblemType[df$IsCorrect] <- "correct"
    
    # Then override for special cases
    correct_percentage <- df$Percentage[df$IsCorrect]
    
    if (length(correct_percentage) > 0) {
      if (correct_percentage >= easy_threshold) {
        df$ProblemType[df$IsCorrect] <- "too_easy_correct"
      } else if (correct_percentage <= hard_threshold) {
        df$ProblemType[df$IsCorrect] <- "too_hard_correct"
      }
    }
    
    # Mark problematic distractors
    df$ProblemType[df$Option %in% problem_info$problematic] <- "bad_distractor"
    df$ProblemType[df$Option %in% problem_info$non_functional] <- "non_functional"
    
    return(df)
  }))
  
  # ---- ENHANCED VISUAL SCHEME ----
  fill_colors <- c(
    "FALSE" = "#FF6B6B",  # Coral for incorrect
    "TRUE" = "#4ECDC4"    # Turquoise for correct
  )
  
  # More distinct threshold line colors
  threshold_colors <- c(
    problematic = "#FF9F1C",  # Orange
    low = "#FFE66D",         # Yellow
    correct = "#2EC4B6",     # Teal
    easy = "#457B9D",        # Blue (for easy)
    hard = "#E63946"         # Red (for hard)
  )
  
  # Marker specifications - triangles for distractors, text for difficulty
  marker_specs <- list(
    bad_distractor = list(color = "#FF9F1C", fill = "yellow", symbol = "!", shape = 24, size = 3),
    non_functional = list(color = "#A8DADC", fill = "yellow", symbol = "!", shape = 25, size = 3),
    too_easy_correct = list(color = "#457B9D", symbol = "EASY", size = 4),
    too_hard_correct = list(color = "#E63946", symbol = "HARD", size = 4)
  )
  
  # Create base plot
  p <- ggplot(plot_data, aes(x = Option, y = Percentage, fill = IsCorrect)) +
    geom_bar(stat = "identity", width = 0.7, alpha = 0.9, color = "white", linewidth = 0.3) +
    
    # Enhanced threshold lines
    geom_hline(yintercept = problematic_threshold, linetype = "dotted", 
               color = threshold_colors["problematic"], linewidth = 1) +
    geom_hline(yintercept = low_threshold, linetype = "dotted", 
               color = threshold_colors["low"], linewidth = 1) +
    geom_hline(yintercept = correct_threshold, linetype = "dotted", 
               color = threshold_colors["correct"], linewidth = 1) +
    geom_hline(yintercept = easy_threshold, linetype = "dotted", 
               color = threshold_colors["easy"], linewidth = 1) +
    geom_hline(yintercept = hard_threshold, linetype = "dotted", 
               color = threshold_colors["hard"], linewidth = 1) +
    
    scale_fill_manual(
      name = "Answer Type:",
      labels = c("Incorrect", "Correct"),
      values = fill_colors
    ) +
    facet_wrap(~Item, scales = "free_x", ncol = min(max_facet_cols, length(items_to_plot))) +
    labs(
      title = "Distractor Analysis with Problem Indicators",
      x = "Response Option",
      y = "Percentage of Students"
    ) +
    theme_minimal(base_size = 12) +
    theme(
      axis.text.x = element_text(angle = 45, hjust = 1, size = 10),
      legend.position = "bottom",
      plot.title = element_text(face = "bold", size = 14, hjust = 0.5),
      strip.text = element_text(face = "bold", size = 11),
      panel.grid.major.x = element_blank(),
      panel.spacing = unit(1.5, "lines")
    )
  
  # Add all problem markers
  problematic_data <- subset(plot_data, ProblemType %in% names(marker_specs))
  if (nrow(problematic_data) > 0) {
    # Add triangular markers for distractors
    distractor_data <- subset(problematic_data, ProblemType %in% c("bad_distractor", "non_functional"))
    if (nrow(distractor_data) > 0) {
      p <- p + 
        geom_point(
          data = distractor_data,
          aes(shape = ProblemType),
          color = sapply(distractor_data$ProblemType, function(x) marker_specs[[x]]$color),
          fill = sapply(distractor_data$ProblemType, function(x) marker_specs[[x]]$fill),
          size = 5, stroke = 1.5, position = position_nudge(y = 5)
        ) +
        scale_shape_manual(
          name = "Distractor Problems:",
          values = c("bad_distractor" = 24, "non_functional" = 25),
          labels = c(
            "Overly selected distractor",
            "Non-functional distractor"
          )
        ) +
        geom_text(
          data = distractor_data,
          aes(label = sapply(ProblemType, function(x) marker_specs[[x]]$symbol)),
          position = position_nudge(y = 5),
          size = 4, vjust = 0.5, fontface = "bold",
          color = sapply(distractor_data$ProblemType, function(x) marker_specs[[x]]$color)
        )
    }
    
    # Add text markers for difficulty
    difficulty_data <- subset(problematic_data, ProblemType %in% c("too_easy_correct", "too_hard_correct"))
    if (nrow(difficulty_data) > 0) {
      p <- p +
        geom_text(
          data = difficulty_data,
          aes(label = sapply(ProblemType, function(x) marker_specs[[x]]$symbol)),
          position = position_nudge(y = 5),
          size = 4, vjust = 0.5, fontface = "bold",
          color = sapply(difficulty_data$ProblemType, function(x) marker_specs[[x]]$color)
        )
    }
  }
  
  return(p)
}
# Visualize results
plot_distractor_analysis(
  analysis_results = distractor_analysis,
  key = answer_key_list,
  items_to_plot = names(distractor_analysis$item_stats)
)


Distractor Analysis Plot Interpretation Guide

  • Bar Colors:
    • Turquoise Bars: Correct answers (#4ECDC4)
    • Coral Bars: Incorrect options/distractors (#FF6B6B)

Threshold Reference Lines

  • Orange Dotted Line (30% default):
    • Problematic Distractor Threshold: Any distractor crossing this line is potentially too attractive
    • Action: Review distractors marked with orange "!" triangles
  • Yellow Dotted Line (5% default):
    • Non-functional Threshold: Distractors below this line aren't being selected enough
    • Action: Consider removing or revising options marked with teal "!" triangles
  • Blue Dotted Line (90% default):
    • Too Easy Threshold: Correct answers above this line may be too simple
    • Action: Items marked "EASY" in blue may need increased difficulty
  • Red Dotted Line (20% default):
    • Too Hard Threshold: Correct answers below this line indicate difficult items
    • Action: Items marked "HARD" in red may need revision
  • Teal Dotted Line (40% default):
    • Minimum Acceptable Threshold: Desired baseline for correct answers
    • Action: Items consistently below warrant investigation

Warning Markers

Marker Color Meaning Threshold Reference
Orange Over-selected distractor >30% selection
Teal Non-functional distractor <5% selection
EASY Blue Suspiciously easy item >90% correct
HARD Red Problematic hard item <20% correct

Interpretation Framework

For Individual Items:

  • Well-Functioning Item:
    • Clear turquoise peak (correct answer)
    • All distractors below orange line
    • Correct percentage between 40-90%
    • No warning markers present
  • Problem Indicators:
    • Any coral bar crossing orange line (with ▲)
    • Teal ▼ on low distractors (<5%)
    • "EASY"/"HARD" text markers
    • Multiple high bars suggesting ambiguity

For Test-Wide Quality:

  • Good Test Indicators:
    • Majority of items show one dominant turquoise bar
    • Few distractors exceed 30% selection
    • Most correct answers in 40-90% range
    • Balanced distribution of item difficulties
  • 🚨 Problem Areas:
    • Clustering of "HARD" labels (many difficult items)
    • Numerous "EASY" labels (potential ceiling effect)
    • Frequent ▲ markers (distractors needing revision)
    • Many ▼ markers (non-functional options)

Actionable Recommendations

  • For Over-Selected Distractors (▲):
    • Analyze why the distractor is attractive
    • Check for accidental correctness or ambiguity
    • Consider rewriting or replacing
  • For Non-Functional Distractors (▼):
    • Verify if the option is implausible
    • Determine if the distractor is too obvious
    • May need complete replacement
  • For "HARD" Items:
    • Check alignment with taught material
    • Review for unclear wording
    • Consider scaffolding or partial credit
  • For "EASY" Items:
    • Verify they test meaningful knowledge
    • Check for excessive clues in stem
    • May need increased complexity

Note on Threshold Customization: All thresholds can be adjusted via function parameters to match your assessment goals. The defaults reflect common psychometric standards but should be calibrated to your specific context.


2.10 Comparison to Item Response Theory

CTT offers several advantages over Item Response Theory (IRT), primarily due to its simplicity, and minimal theoretical assumptions, making it broadly applicable across diverse testing contexts (Fan, 1998). CTT requires little more than observed test scores, which makes it accessible to researchers and practitioners without specialized statistical training. At the test level, CTT provides widely recognized reliability coefficients such as Cronbach’s alpha, test–retest reliability, and split-half reliability. At the item level, it incorporates essential statistics such as item difficulty (the proportion of examinees who answer an item correctly) and item discrimination (the degree to which an item differentiates between high - and low- scoring individuals). Unlike IRT, which models the probabilistic relationship between latent ability and item responses, CTT provides a straightforward descriptive framework for evaluating test quality.

This simplicity, however, comes with important limitations. CTT statistics are sample-dependent: both person measures (e.g., observed scores) and item parameters (e.g., difficulty and discrimination) can vary considerably across different populations. For example, an item that appears easy in one sample may not function similarly in another with different background characteristics. This sample dependence complicates applications such as test equating, longitudinal measurement, and computerized adaptive testing, where item and person invariance are crucial (Fan, 1998). Furthermore, because CTT assumes that measurement error is uniformly distributed across the score scale, it cannot easily model differences in precision at different ability levels—something IRT handles explicitly.

Despite these constraints, several CTT-based solutions have been developed to mitigate its shortcomings. Equipercentile equating, for instance, allows practitioners to empirically align test scores from different forms by matching percentile distributions. Thurstone’s absolute scaling (Engelhard, 1990), as cited in (Fan, 1998) extends the CTT framework by introducing procedures for item-invariant measurement, showing that even within CTT it is possible to achieve more generalizable parameter estimates under certain conditions. In addition, advances in generalizability theory, which extends CTT concepts to multiple sources of measurement error, provide a more nuanced understanding of reliability across testing occasions, raters, and item facets.


References

Abramson, L. Y., Seligman, M. E. P., & Teasdale, J. D. (1978). Learned helplessness in humans: Critique and reformulation. Journal of Abnormal Psychology, 87(1), 49–74. https://doi.org/10.1037/0021-843X.87.1.49
Ainslie, G. (1975). Specious reward: A behavioral theory of impulsiveness and impulse control. Psychological Bulletin, 82(4), 463–496. https://doi.org/10.1037/h0076860
Allen, M. J., & Yen, W. M. (1979). Introduction to measurement theory. Brooks/Cole.
Allen, M. J., & Yen, W. M. (2002a). Introduction to measurement theory (2nd ed.). Waveland Press.
Allen, M. J., & Yen, W. M. (2002b). Introduction to measurement theory. Waveland Press.
Ames, C. (1992). Classrooms: Goals, structures, and student motivation. Journal of Educational Psychology, 84(3), 261–271. https://doi.org/10.1037/0022-0663.84.3.261
Ashcraft, M. H. (2002). Math anxiety: Personal, educational, and cognitive consequences. Current Directions in Psychological Science, 11(5), 181–185. https://doi.org/10.1111/1467-8721.00196
Ashcraft, M. H., & Kirk, E. P. (2001). The relationships among working memory, math anxiety, and performance. Journal of Experimental Psychology: General, 130(2), 224–237. https://doi.org/10.1037/0096-3445.130.2.224
Ashcraft, M. H., & Krause, J. A. (2007). Working memory, math performance, and math anxiety. Psychonomic Bulletin & Review, 14(2), 243–248. https://doi.org/10.3758/BF03194059
Ashcraft, M. H., & Moore, A. M. (2009). Mathematics anxiety and the affective drop in performance. Journal of Psychoeducational Assessment, 27(3), 197–205. https://doi.org/10.1177/0734282908330580
Ashcraft, M. H., & Ridley, K. S. (2005). Math anxiety and its cognitive consequences: A tutorial review. In J. I. D. Campbell (Ed.), Handbook of mathematical cognition (pp. 315–327). Psychology Press.
Bandura, A. (1986). Social foundations of thought and action: A social cognitive theory. Prentice-Hall.
Bandura, A. (1997). Self-efficacy: The exercise of control. W.H. Freeman.
Bentler, P. M. (1990). Comparative fit indexes in structural models. Psychological Bulletin, 107(2), 238–246. https://doi.org/10.1037/0033-2909.107.2.238
Blascovich, J., & Mendes, W. B. (2000). Challenge and threat appraisals: The role of affective cues. Advances in Experimental Social Psychology, 32, 361–422.
Blascovich, J., Mendes, W. B., Hunter, S. B., Lickel, B., & Kowai-Bell, J. (2001). Perceiver threat in social interactions with stigmatized others. Journal of Personality and Social Psychology, 80(2), 253–267. https://doi.org/10.1037/0022-3514.80.2.253
Brown, W. (1910). Some experimental results in the correlation of mental abilities. British Journal of Psychology, 3(271-295).
Buckner, R. L., Andrews-Hanna, J. R., & Schacter, D. L. (2008). The brain’s default network. Annals of the New York Academy of Sciences, 1124, 1–38. https://doi.org/10.1196/annals.1440.011
Carmines, E. G., & Zeller, R. A. (1979). Reliability and validity assessment. Sage Publications.
Cattell, R. B. (1966). The scree test for the number of factors. Multivariate Behavioral Research, 1, 245–276.
Crocker, L., & Algina, J. (1986). Introduction to classical and modern test theory. Holt, Rinehart; Winston.
Cronbach, L. J. (1951). Coefficient alpha and the internal structure of tests. Psychometrika, 16(3), 297–334.
DeVellis, R. F. (2017). Scale development: Theory and applications (4th ed.). Sage Publications.
Dweck, C. S. (2008). Mindset: The new psychology of success. Random House.
Eccles, J. S. (1983). Expectancies, values, and academic behaviors. In J. T. Spence (Ed.), Achievement and achievement motives (pp. 75–146). W. H. Freeman.
Eccles, J. S., & Wigfield, A. (1998). Motivational beliefs, values, and goals. In Annual review of psychology (Vol. 53, pp. 109–132). https://doi.org/10.1146/annurev.psych.53.100901.135153
Eccles, J. S., & Wigfield, A. (2002). Motivational beliefs, values, and goals. Annual Review of Psychology, 53, 109–132. https://doi.org/10.1146/annurev.psych.53.100901.135153
Elliot, A. J., & Church, M. A. (2006). Motivational influences on academic performance. Journal of Educational Psychology, 98(3), 542–555. https://doi.org/10.1037/0022-0663.98.3.542
Engelhard, Jr., George. (1990). Historical views of the concept of invariance in measurement theory. Publisher Name.
Eysenck, M. W. (2007). Fundamentals of cognition. Psychology Press.
Eysenck, M. W., Derakshan, N., Santos, R., & Calvo, M. G. (2012). Anxiety and cognitive performance: Attentional control theory. Psychology Press.
Fan, X. (1998). Item response theory and classical test theory: An empirical comparison of their item/person statistics. Educational and Psychological Measurement, 58, 357–381.
Feldt, L. S., & Brennan, R. L. (1989). Reliability. In R. L. Linn (Ed.), Educational measurement (3rd ed., pp. 105–146). Macmillan.
Ferster, C. B., & Skinner, B. F. (1957). Schedules of reinforcement. Appleton-Century-Crofts.
Festinger, L. (1954). A theory of social comparison processes. Human Relations, 7(2), 117–140. https://doi.org/10.1177/001872675400700202
Flake, J. K., Barron, K. E., Hulleman, C., McCoach, D. B., & Welsh, M. E. (2015). Measuring cost: The forgotten component of expectancy-value theory. Contemporary Educational Psychology, 41, 232–244. https://doi.org/10.1016/j.cedpsych.2015.03.002
Fornell, C., & Larcker, D. F. (1981). Evaluating structural equation models with unobservable variables and measurement error. Journal of Marketing Research, 18(1), 39–50. https://doi.org/10.1177/002224378101800104
Glorfeld, L. W. (1995). An improvement on horn’s parallel analysis methodology for selecting the correct number of factors to retain. Educational and Psychological Measurement, 55(3), 377–393.
Goetz, T., Bieg, M., Lüdtke, O., Pekrun, R., & Hall, N. C. (2013). Do girls really experience more anxiety in mathematics? Psychological Science, 24(10), 2079–2087. https://doi.org/10.1177/0956797613486989
Gross, J. J. (2002). Emotion regulation: Affective, cognitive, and social consequences. Psychophysiology, 39(3), 281–291. https://doi.org/10.1017.S0048577201393198
Gross, J. J. (2015). Handbook of emotion regulation (2nd ed.). Guilford Press.
Grupe, D. W., & Nitschke, J. B. (2013). Uncertainty and anticipation in anxiety: An integrated neurobiological and psychological perspective. Nature Reviews Neuroscience, 14(7), 488–501. https://doi.org/10.1038/nrn3524
Haertel, E. H. (2006). Reliability. In R. L. Brennan (Ed.), Educational measurement (4th ed., pp. 65–110). Praeger.
Hair, J. F., Anderson, R. E., Tatham, R. L., & Black, W. C. (2006). Multivariate data analysis (6th ed.). Pearson.
Hair, J. F., Black, W. C., Babin, B. J., & Anderson, R. E. (2019). Multivariate data analysis (8th ed.). Cengage.
Henseler, J., Ringle, C. M., & Sarstedt, M. (2015). A new criterion for assessing discriminant validity in variance-based structural equation modeling. Journal of the Academy of Marketing Science, 43(1), 115–135. https://doi.org/10.1007/s11747-014-0403-8
Hopko, D. R., Mahadevan, R., Bare, R. L., & Hunt, M. K. (2003). The abbreviated math anxiety scale (AMAS): Construction, validity, and reliability. Assessment, 10(2), 178–182. https://doi.org/10.1177/1073191103010002008
Horn, J. L. (1965). A rationale and a test for the number of factors in factor analysis. Psychometrika, 30, 179–185.
Hu, L., & Bentler, P. M. (1999). Cutoff criteria for fit indexes in covariance structure analysis: Conventional criteria versus new alternatives. Structural Equation Modeling, 6(1), 1–55. https://doi.org/10.1080/10705519909540118
Huguet, P., Dumas, F., Marsh, H., Régner, I., Wheeler, L., Suls, J., Seaton, M., & Nezlek, J. (2009). Clarifying the role of social comparison in the big-fish-little-pond effect (BFLPE): An integrative study. Journal of Personality and Social Psychology, 97(1), 156–170. https://doi.org/10.1037/a0015558
Kaiser, H. F. (1960). The application of electronic computers to factor analysis. Educational and Psychological Measurement, 20(1), 141–151. https://doi.org/10.1177/001316446002000116
Kaiser, H. F. (1974). An index of factorial simplicity. Psychometrika, 39(1), 31–36. https://doi.org/10.1007/BF02291575
Karabenick, S. A. (2003). Seeking help in large college classes: A person-centered approach. Contemporary Educational Psychology, 28(1), 37–58. https://doi.org/10.1016/S0361-476X(02)00012-7
Kline, P. (1999). The handbook of psychological testing (2nd ed.). Routledge.
Kuder, G. F., & Richardson, M. W. (1937). The theory of the estimation of test reliability. Psychometrika, 2(3), 151–160.
Lazarus, R. S. (1991). Emotion and adaptation. Oxford University Press.
LeDoux, J. (1996). The emotional brain: The mysterious underpinnings of emotional life. Simon & Schuster.
Lord, F. M., & Novick, M. R. (1968). Statistical theories of mental test scores. Addison-Wesley.
Lupien, S. J., Maheu, F., Tu, M., Fiocco, A., & Schramek, T. E. (2007). The effects of stress and stress hormones on human cognition: Implications for the field of brain and cognition. Brain and Cognition, 65(3), 209–237. https://doi.org/10.1016/j.bandc.2007.02.007
Lyons, I. M., & Beilock, S. L. (2012). When math hurts: Math anxiety predicts pain network activation in anticipation of doing math. PLOS ONE, 7(10), e48076. https://doi.org/10.1371/journal.pone.0048076
Maloney, E. A., & Beilock, S. L. (2012). Math anxiety: Who has it, why it develops, and how to guard against it. Trends in Cognitive Sciences, 16(8), 404–406. https://doi.org/10.1016/j.tics.2012.06.008
Markus, H. R., & Nurius, P. (1986). Possible selves. American Psychologist, 41(9), 954–969. https://doi.org/10.1037/0003-066X.41.9.954
Marsh, H. W. (1990). Causal ordering of academic self-concept and academic achievement: A multiwave, longitudinal panel analysis. Journal of Educational Psychology, 82(4), 646–656. https://doi.org/10.1037/0022-0663.82.4.646
Mattarella-Micke, A., Mateo, J., Kozak, M. N., Foster, K., & Beilock, S. L. (2011). Choke or thrive? The relation between salivary cortisol and math performance depends on individual differences in working memory and math-anxiety. Emotion, 11(4), 1000–1005. https://doi.org/10.1037/a0023224
Murray, H. A. (1938). Explorations in personality. Oxford University Press.
Novick, M. R. (1966). The axioms and principal results of classical test theory. Journal of Mathematical Psychology, 3(1), 1–18. https://doi.org/10.1016/0022-2496(66)90002-0
Núñez-Peña, M. I., & Suárez-Pellicioni, M. (2016a). Math anxiety: A review of its cognitive consequences, psychophysiological correlates, and brain bases. Cognitive, Affective, & Behavioral Neuroscience, 16(1), 3–22. https://doi.org/10.3758/s13415-015-0370-7
Núñez-Peña, M. I., & Suárez-Pellicioni, M. (2016b). Math anxiety: A review of its cognitive consequences, psychophysiological correlates, and brain bases. Cognitive, Affective, & Behavioral Neuroscience, 16(1), 3–22. https://doi.org/10.3758/s13415-015-0370-7
Nunnally, J. C. (1978). Psychometric theory (2nd ed.). McGraw-Hill.
Nunnally, J. C., & Bernstein, I. H. (1994). Psychometric theory (3rd ed.). McGraw-Hill.
Owens, M., Stevenson, J., Hadwin, J. A., & Norgate, R. (2012). Anxiety and depression in academic performance: An exploration of the mediating factors of worry and working memory. School Psychology International, 33(4), 433–449. https://doi.org/10.1177/0143034311427433
Oyserman, D. (2007). Social identity and self-regulation. Social Psychology: Handbook of Basic Principles, 432–453.
Pajares, F., & Miller, M. D. (1994). Role of self-efficacy and self-concept beliefs in mathematical problem solving: A path analysis. Journal of Educational Psychology, 86(2), 193–203. https://doi.org/10.1037/0022-0663.86.2.193
Pekrun, R. (2006). The control-value theory of achievement emotions: Assumptions, corollaries, and implications for educational research and practice. Educational Psychology Review, 18(4), 315–341. https://doi.org/10.1007/s10648-006-9029-9
Pekrun, R., & Linnenbrink-Garcia, E. J. (2012). Academic emotions and student engagement. In S. L. Christenson, A. L. Reschly, & C. Wylie (Eds.), Handbook of research on student engagement (pp. 259–282). Springer.
Pekrun, R., & Linnenbrink-Garcia, L. (2017). Emotions at school and academic achievement. Educational Psychologist, 52, 73–86.
Peterson, R. A. (2000). A meta-analysis of variance accounted for and factor loadings in exploratory factor analysis. Marketing Letters, 11(3), 261–275. https://doi.org/10.1023/A:1008191211004
Phelps, E. A., & LeDoux, J. E. (2005). Contributions of the amygdala to emotion processing: From animal models to human behavior. Neuron, 48(2), 175–187. https://doi.org/10.1016/j.neuron.2005.09.025
Porges, S. W. (2007). The polyvagal perspective. Biological Psychology, 74(2), 116–143. https://doi.org/10.1016/j.biopsycho.2006.06.009
Raichle, M. E., MacLeod, A. M., Snyder, A. Z., Powers, W. J., Gusnard, D. A., & Shulman, G. L. (2001). A default mode of brain function. Proceedings of the National Academy of Sciences, 98(2), 676–682. https://doi.org/10.1073/pnas.98.2.676
Ramirez, G., Gunderson, E. A., Levine, S. C., & Beilock, S. L. (2013). Math anxiety, working memory, and math achievement in early elementary school. Journal of Cognition and Development, 14, 187–202.
Richardson, F. C., & Suinn, R. M. (1972). The mathematics anxiety rating scale: Psychometric data. Journal of Counseling Psychology, 19(6), 551–554. https://doi.org/10.1037/h0033456
Salkovskis, P. M. (1991). The importance of behaviour in the maintenance of anxiety and panic: A cognitive account. Behavioural Psychotherapy, 19(1), 6–19. https://doi.org/10.1017/S0141347300011472
Sapolsky, R. M. (2004). Why zebras don’t get ulcers (3rd ed.). Henry Holt; Company.
Sarason, I. G. (1984). Stress, anxiety, and cognitive interference: Reactions to tests. Journal of Personality and Social Psychology, 46(4), 929–938. https://doi.org/10.1037/0022-3514.46.4.929
Schunk, D. H. (1987). Peer models and children’s behavioral change. Review of Educational Research, 57(2), 149–174. https://doi.org/10.3102/00346543057002149
Shavelson, R. J., Hubner, J. J., & Stanton, G. C. (1976). Self-concept: Validation of construct interpretations. Review of Educational Research, 46(3), 407–441. https://doi.org/10.3102/00346543046003407
Skinner, B. F. (1938). The behavior of organisms: An experimental analysis. Appleton-Century.
Skinner, B. F. (1953). Science and human behavior. Macmillan.
Solomon, L. J., & Rothblum, E. D. (1984). Academic procrastination: Frequency and cognitive-behavioral correlates. Journal of Counseling Psychology, 31(4), 503–509. https://doi.org/10.1037/0022-0167.31.4.503
Spearman, C. (1910). Correlation calculated with faulty data. British Journal of Psychology, 3(271-295).
Steiger, J. H. (1990). Structural model evaluation and modification: An interval estimation approach. Multivariate Behavioral Research, 25(2), 173–180. https://doi.org/10.1207/s15327906mbr2502_4
Stodolsky, S. S. (1985). Telling math: Origins of math aversion and anxiety. Educational Psychologist, 20(3), 125–133. https://doi.org/10.1207/s15326985ep2003_2
Tajfel, H., & Turner, J. C. (1979). An integrative theory of intergroup conflict. In W. G. Austin & S. Worchel (Eds.), The social psychology of intergroup relations (pp. 33–47). Brooks/Cole.
Thayer, J. F., Åhs, F., Fredrikson, M., III, J. J. S., & Wager, T. D. (2012). A meta-analysis of heart rate variability and neuroimaging studies: Implications for heart rate variability as a marker of stress and health. Neuroscience & Biobehavioral Reviews, 36(2), 747–756. https://doi.org/10.1016/j.neubiorev.2011.11.009
Tobias, S. (1986). Anxiety and mathematics. Teachers College Press.
Traub, R. E. (1997). Classical test theory in historical perspective. Educational Measurement: Issues and Practice, 16(4), 8–14. https://doi.org/10.1111/j.1745-3992.1997.tb00603.x
Tristan, L. A. (1998). The item discrimination index: Does it work? Rasch Measurement Transactions, 12(1), 626.
Turner, J. H. (2002). Face-to-face: Towards a sociological theory of interpersonal behavior. Sociological Perspectives, 45(3), 193–216. https://doi.org/10.1525/sop.2002.45.3.193
Weiner, B. (1985). An attributional theory of achievement motivation and emotion. Springer-Verlag.
White, R. W. (1959). Motivation reconsidered: The concept of competence. Psychological Review, 66(5), 297–333. https://doi.org/10.1037/h0040934
Wigfield, A., Eccles, J. S., Fredricks, J. A., Simpkins, S., Roeser, R. W., & Schiefele, U. (2016). Development of achievement motivation and engagement. Handbook of Child Psychology and Developmental Science, 3, 1–44. https://doi.org/10.1002/9781118963418.childpsy316
Wigfield, A., & Meece, J. L. (1988). Math anxiety in elementary and secondary school students. Journal of Educational Psychology, 80(2), 210–216. https://doi.org/10.1037/0022-0663.80.2.210
Zimbardo, P. G., & Boyd, J. N. (1999). Putting time in perspective: A valid, reliable individual-differences metric. Journal of Personality and Social Psychology, 77(6), 1271–1288. https://doi.org/10.1037/0022-3514.77.6.1271