Psychometrics & Factor Analysis

Topic #2: Developing Psychometric Scales

Cort W. Rudolph, Ph.D.

Wayne State University

1 September, 2025

Topic #2: Developing Psychometric Scales

First, A Bit of Review

A general model of measurement:

Where…

\(\psi\) = variance associated with the latent variable (i.e., \(\eta\))
\(\eta\) = latent variable
\(\lambda\) = factor loading associating observed variable (i.e., \(Y\)) to latent variable (i.e., \(\eta\))
\(\epsilon\) = error associated with observed variables (i.e., \(Y\))
\(\theta\) = variance of error (i.e., \(\epsilon\)) associated with observed variables (i.e., \(Y\))

Developing Psychometric Scales

There is a lot of guidance in the literature regarding “best practices” for developing psychometric scales

In general, this process unfolds in three major phases, each with several steps:

Phase #1: Developing a Measurement Theory
- Step 1: Define the Construct and its Content Domain
- Step 2: Specification of Measurement Properties

Phase #2: Developing the Measure
- Step 1: Item Generation
- Step 2: Item Wording
- Step 3: Number of Items
- Step 4: Item Scaling
- Step 5: Content Validity Assessment

Phase #3: Testing the Measure
- Step 1: Selecting a Sample
- Step 2: Preliminary Factor Analysis
- Step 3: Confirmatory Factor Analysis
- Step 4: Internal Consistency Assessment
- Step 5: Convergent and Discriminant Validity
- Step 6: Criterion-Related Validity
- Step 7: Replication

Phase #1: Developing a Measurement Theory

Step 1: Define the Construct and its Content Domain

The first, and most important step of developing a new measure is to clearly define the construct that you intend it to reflect

The “domain sampling” or “content oriented” approach to test construction starts by situating the construct of interest within theory by defining it in terms of it’s content domain

A construct’s content domain refers to the set of interrelated attributes (e.g., knowledge, skills, abilities, behaviors, attitudes, values, ways of thinking, mindsets, feelings, etc.) which are included under a construct label

Step 1: Define the Construct and its Content Domain

To define a content domain, you must indicate what you intend to measure. This may emerge inductively or deductively, with the latter being grounded in/informed by theory.

Some questions to ask yourself about your construct at this stage:

What is the behavioral domain of the contstruct?
What does this construct represent?
What are behavioral examples of high and low reflections of a construct?
How is this construct different from or similar to other, related constructs?
What is the dimensionality of the construct? (n.b., a consideration of dimensionality at this stage may help build sampling plans, even if you don’t explicitly hypothesize sub-facets)

Once established, it’s good practice to review your construct and content domain definitions with subject matter experts (SME’s) and potentially revise your definitions based on that review. This is especially important if your are developing a measure about something you don’t know much about (e.g., job-specific knowledge).

Step 1: Define the Construct and its Content Domain

Consider, for example, the following construct and content domain definitions for trait conscientiousness:

Construct Definition: Conscientiousness is the personality trait of being thorough, careful, and/or vigilant; Conscientiousness implies a desire to perform tasks well

Content Domain: Conscientious people…

…are efficient and organized as opposed to easy-going and disorderly
…exhibit a tendency to show self-discipline
…act dutifully, and aim for achievement
…display planned rather than spontaneous behavior
…are generally organized and dependable

Conscientiousness is manifested in characteristic behaviors such as…

…being neat and systematic
…carefulness, thoroughness, and deliberation
…the tendency to think carefully before acting

Step 1: Define the Construct and its Content Domain

Another example…

Weaker:

Confidence in Wine Knowledge is defined as one’s self-perceived knowledge regarding wine.

Stronger:

Confidence in Wine Knowledge is defined as one’s self-perceived knowledge regarding wine topics that include wine production (i.e., how the wine is harvested and then made), wine storage (i.e., details regarding proper ways to store wine and for how long), and grape varietals (i.e., the relation between types of grapes and wine categories, how varietals influence the taste of wine). Those with high levels of Confidence in Wine Knowledge tend to have more confidence in their general wine knowledge and believe themselves to be knowledgeable in these domains, such that they believe they are wine experts. High scorers will indicate they know a lot about wine, are knowledgeable of the different wine varietals, possess confidence in their understanding of wine production, and claim knowledge regarding wine storage. Those with low scores tend to have less confidence in their wine knowledge. This definition was inspired by prior research of McClung, Freeman, & Malone (2015), who looked at confidence in wine primarily at the consumer level of purchasing/selecting wines. The Confidence in Wine Knowledge construct, while related, deals explicitly with the aforementioned confidence topics of wine production, wine storage, and wine varietals. This construct should be related to experience with wine (e.g., drinking wine, learning about wine, fine dining).

Step 1: Define the Construct and its Content Domain

Content domains can include aspects or “levels” of a construct defined from different psychological perspectives (e.g., cognitive, affective, behavioral)

For example, consider the construct of “overwork expectations,” which refers to undue demands put on people’s expectations to complete their work beyond normal business hours and several possible ways that this idea could be assessed:

Cognitive level (Perceptual): “How would you describe the culture surrounding working time at this organization?”
Affective level (Feelings): “How do you feel about being pressured to work on the weekend?”
Behavioral level (Actions): “How often do you take work home with you?”

Step 2: Specification of Measurement Properties

There are several “specification decisions” that must be made initially that guide the development of desired measure.

Type of Measure
Intended Use & Context
Define the Nomological Network
Dimensionality
Format & Administration

These decisions are largely guided by a) the construct definition and b) its content domain

Step 2: Specification of Measurement Properties

Decision #1: Type of Measure

Measures of maximal performance: intended to assess an individual’s all-out effort; typically have a “known” correct answer (e.g., aptitude tests or achievement tests, including classroom exams and personnel selection tests).

Aptitude Test: A test that emphasizes innate potential and informal learning, and is used to predict future performance and/or behavior.
Achievement Test: A test that emphasizes what an individual currently knows or can do with regard to a particular subject matter.

Measures of typical performance: No single “correct” response (e.g., personality and interest inventories; attitude scales)

Step 2: Specification of Measurement Properties

Decision #2: Use & Contexts

The major concern regarding use and context at this stage is “domain specificity,” which broadly refers to whether the measure is intended to assess something that applies to all people and is somewhat constant across contexts (i.e., “domain general”) or whether the measure is intended to assess something that is applied to/manifests uniquely in a particular situation or context (i.e., “domain specific”)

For example, we could assess trait conscientiousness at work (i.e., domain specific) versus in day-to-day life (i.e., domain general)

There are benefits of limiting the content domain to that context in which we are specifically interested, including:

Reducing error/noise in our measurements by increasing reliability
Increasing face validity, which refers to the extent to which a test or assessment device appears to be valid

A couple of examples of questionable face validity:

The thematic apperception test (TAT) is projective measure in which respondents provide verbal responses, in the form of narratives they make up about ambiguous pictures of people, revealing their underlying motives, concerns, and the way they see the social world.

Lockwood, Jordan, & Kunda (2002) prevention/promotion regulatory focus scale

Step 2: Specification of Measurement Properties

Decision #3: Define the Nomological Network

A nomological network¹ refers to a “lawful pattern” of interrelationships that exists between hypothetical constructs and observable attributes that serves to guide researchers in establishing the construct validity of a psychological test or measure.

A nomological network includes a theoretical framework for what is being measured, specifying linkages between different…

Unobservable constructs
Observable attributes/measures
Hypothetical constructs and observable attributes/measures

Qualitatively different measurement operations (e.g., self-reports; other-reports) may be said to measure the same attributes if their locations in the nomological network link them to the same hypothetical construct variable.

Step 2: Specification of Measurement Properties

Decision #3: Define the Nomological Network

In developing a nomological network, it’s important to be able to distinguish differences between your construct and related constructs.

Thus, an essential question at this stage is, “How is this construct similar to and different from related constructs? This leads to two related ideas:

Convergent/Discriminant Validity:

Convergent validity (i.e., what a measure is theoretically related to)
Discriminant validity (i.e., what a measure is theoretically unrelated to)

A Caution about Jingle/Jangle Fallacies:

Jingle Fallacy: Erroneous assumptions that two different constructs are the same because they bear the same name
Jangle Fallacy: Erroneous assumptions that two identical or almost identical constructs are different because they are labeled differently.

Step 2: Specification of Measurement Properties

Decision #3: Define the Nomological Network

Contrived Nomological Network for Self-Focused Constructs

Step 2: Specification of Measurement Properties

Decision #4: Dimensionality

Contrived Multidimensional Structure of Conservatism

Step 2: Specification of Measurement Properties

Decision #4: Dimensionality

Unidimensionality: Some constructs are fairly narrow in their nature, and can be represented with a single, underlying latent variable

Multidimensionality: Many constructs are fairly broad in their nature, but are themselves composed of several related components that must be measured to assess the construct fully. Such constructs must be represented with multiple underlying latent variables

If a construct is multidimensional, are all dimensions weighted equally? If not, how much weight should each dimension or facet have in the final scale?

This breaks down to a pretty fundamental question regarding your theory of measurement: Are all dimensions equally important, or are some more important than others?

There are multiple weighting strategies you could adopt, but most simply the number of test items per dimension should reflect these decisions

Step 2: Specification of Measurement Properties

Decision #4: Format & Administration

Format concerns the question, “How will responses be collected?” There are two broad options regarding format, but these are not mutually exclusive (i.e., many measures include both options).

First, free/open-ended responses (e.g., interviews, questionnaires, essay examinations, projective tests, etc.)

Second, limited/closed-ended/constructed responses (e.g., multiple choice exams, scales using Likert-type response options)

There are numerous advantages/disadvantages of each in terms of…

The expertise required for test administration (i.e., closed ended items minimize expertise required for test administration)
Analysis/scoring (i.e., it’s easier to analyze closed-ended items because there is little to no subjective judgment or coding required; responses to open-ended items often contain irrelevant information)
Clarification and elaboration (i.e., open ended items allow respondents to clarify their answers; reduce possibility of guessing or cueing).

Free response can be collected in multiple formats:

Written (e.g., essay or short response)
Oral (e.g., in-person or virtual interview)

Constructed responses likewise have multiple formats:

Generally for maximal performance: multiple choice, true/false, matching
Generally for typical performance: various scale “anchors,” such as Strongly Disagree -to- Strongly Agree; A Little Like Me -to- A Lot Like Me; Never -to- Always).

Regarding performance-based tests, there are two general classifications:

Speeded Tests: time-limited, often includes a higher-number of relatively easier questions
Power Tests: not time-limited (at least within reason); fewer questions but will have more difficult questions (usually)

Step 2: Specification of Measurement Properties

Decision #4: Format & Administration

Administration can be done individually, which implies one-on-one administration (e.g., WAIS/WISC; Clinical Assessments) or can be done via group administration (i.e., multiple people at one time can complete the measure; e.g., Wonderlic, SAT; online testing and surveys)

There are advantages/disadvantages of each, that are determined by time and resources

Group \(\rightarrow\) greater cost saving and ease of scoring (e.g., Less time dedication from test administrator).
Individual \(\rightarrow\) allows one to clarify items and test taker’s responses on a one-on-one basis

Step 2: Specification of Measurement Properties

Decision #4: Format & Administration

Test length is likewise a concern for administration

There are advantages (e.g., reliability; opportunity to capture “more” of the content domain, which bolsters construct validity) and disadvantages (i.e., test length is positively associated with test-taker fatigue; negatively associated with test-taker motivation) of longer tests

At this stage, test length is a background concern. During development, the question really should be “How many items need to be initially developed?”

Many items will be discarded in the test development process
Having more items than you ultimately need is better in the early stages of development
Plan to develop two to three times more items initially than you plan to include in the final measure

How many items desired for the final version of the test?

Time constraints often preclude to possibility of a lengthy test
Length must be considered along with item format (e.g., open-ended take longer to complete than closed-ended)

Step 2: Specification of Measurement Properties

Decision #4: Format & Administration

Test Difficulty concerns the question, “What is the likely range of abilities of test-takers?”

The difficulty of items is dependent on the ability of the sample of the population tested (e.g., WISC vs. WAIS)

At this stage, having a clear idea of the abilities of potential test takers is important to ensure appropriate difficulty for the population.

Maximal performance: item difficulty can be roughly determined based on the percentage of test takers who get the item correct.
Typical performance: readability

Items that are written at a level beyond the likely educational attainment of the targeted population of test takers will increase error variance in responses, leading to a less reliable test.

Step 2: Specification of Measurement Properties

Decision #5: Item Specifications

The process of item specification refers to developing and/or adopting a set of rules that dictate item format. Generally, you would choose between how to format a) the item stem and b) the response format

For example…

All items will contain a short stem of 1-2 sentences. Stems will pose a question or statement, and response options will include potential answers to that question, or indications of how much respondents agree with the statement (depending on the form of the stem). Response options should not be longer than a few words. There will be four response options per item, and the response format will be forced-choice such that the respondent must choose only one response.

Step 2: Specification of Measurement Properties

Decision #5: Item Specifications

A quick note on terminology: Moving from responses to scores

The term response options refers to the choices that respondents choose from (i.e., “item responses”, “raw responses”)
From respondents’ choices on such options, the analyst scores the item (i.e., creates “item scores”)

Step 2: Specification of Measurement Properties

Decision #5: Item Specifications

There are numerous item formats that you can choose from, and the choice to some degree hinges on whether the measure is designed to assess typical (e.g., attitudes, values, behavior; no clear “correct” or “incorrect” answer) or maximal performance (i.e., knowledge, skills, ability; clear “correct” or “incorrect” answer)

Performance-Based Formats: multiple-choice, matching, true/false, short answer
Common Self-Report Formats: Likert, forced-choice, true/false

Phase #2: Developing the Measure

We will focus next on the process of scale development with an eye towards assuring that the ultimate measure that you develop possesses sound psychometric properties ¹

We will focus on developing measures with multiple subscales (i.e., “multidimensional” scales; n.b., the process would be the same, although less complex, for developing a single multi-item scale).

Phase #2: Developing the Measure

Several criteria have been proposed for assessing the psychometric soundness of behavioral measures, all of which contribute to providing evidence of construct validity, which refers, generally, to the relationship between the measure and the underlying construct it is attempting to assess:

Content validity refers to the adequacy with which a measure assesses the domain of interest.
Criterion-related validity pertains to the relationship between a measure and another independent measure.
Internal consistency refers to the homogeneity of the items in the measure or the extent to which item responses correlate with the total test score.
Parsimony means that measures should be comprised of the fewest possible items that capture the domain of interest.

Each step of the process described here will contribute to increasing the confidence in the overall construct validity of the new measure.

Step 1: Item Generation

The first stage of scale development is the creation of items to assess the construct under examination. At this point, the researcher’s goal is to develop items that will result in measures that adequately sample the domain of interest to demonstrate content validity.

A strong theoretical rationale that provides the conceptual definitions to be operationalized by the scales under development is a necessary but not sufficient condition for establishing construct validity of the new measure.

Which is to say that, all else being equal, stronger theory initially should translate into stronger construct validity eventually

The most important idea here is that the scales should be evaluated and refined before they are used to collect data from a sample of your intended population.

The two greatest “expenses” in conducting any study are a) the cost of the survey administration (i.e., data collection, time of the researcher and respondents) and b) access to potential appropriate samples.

Thus, it is critically important that the survey measures taken into the field are psychometrically sound.

There are two basic approaches to item development.

The first is deductive, sometimes called logical partitioning or classification from above
The second method is inductive, known also as grouping or classification from below

Step 1: Item Generation

Deductive scale development requires the use of a theoretically-based classification schema or typology prior to data collection.

This approach requires an understanding of the phenomenon to be investigated and a thorough review of the literature to develop the theoretical definition of the construct under examination.

The definition is then used as a guide for the development of items

Advantages:

If properly conducted, it will help assure construct validity in the final scales. Through the development of adequate construct definitions items should accurately capture the domain of interest (i.e., content validity)

Disadvantages:

Very time consuming and requires that researchers possess a working knowledge of the phenomena under investigation
In exploratory research, it may not be appropriate to attempt to impose measures onto an unfamiliar situation (However, in most situations where theory does exist, the deductive approach would be most appropriate)

Step 1: Item Generation

Inductive scale development is so labeled because often little theory is involved at the outset, as one attempts to generate measures from individual items.

This approach requires researchers to ask a sample of respondents to provide descriptions of their feelings or to describe some aspect of their behavior.

Responses are then classified into a number of categories by content analysis based on key words or themes (Williamson et al., 1992) or a sorting process such as the Q-sorting technique with an agreement index of some type, usually using multiple judges (Kerlinger, 1986).

From these categorized responses, items are derived for subsequent analysis.

Advantages:

May be very useful when there is little theory to guide the researcher or when doing exploratory research.
The difficulty arises, however, when attempting to develop items by interpreting the descriptions provided by respondents (e.g., if you are imposing your own values on such responses rather than interpreting in terms of a theoretical framework).

Disadvantages:

Without a definition of the construct under examination, it can be difficult to develop items that will be conceptually consistent.
Requires expertise in content analysis and relies heavily on post hoc factor analytic techniques to ultimately determine scale construction (basing factor structure and, therefore, scales on item covariance rather than similar content).
Though items may load on the same factor, there is no guarantee that they measure the same theoretical construct or come from the same sampling domain (Nunnally, 1978)
Because this technique lacks a theoretical foundation, the researcher is compelled to rely on some type of intuitive framework (i.e., rather than a formal, theoretical framework), with little assurance that obtained results will not contain items that assess extraneous content domains.
This technique also makes the appropriate labeling of constructs more difficult (Conway & Huffcutt, 2003)

Step 1: Item Generation

“Good” item writers should be SME’s, understand the people that will ultimately be completing the measure, have a mastery of verbal communication, and be resourceful (i.e., look for inspiration)

The task now is to generate items to target content that maps onto our construct definition and content domain

A good analogy is that this process is like sampling, hence the term “domain sampling,” The construct definition and content domain should serve as a sampling plan that helps you to that meet the specifications outlined previously

Your goal here is to sample items from the entire content domain to maximize domain coverage. Note, this is not the same as construct redundancy.

Step 2: Item Wording

There are number of additional guidelines for writing items, but the common concerns when developing a new measure boil down to parsimony. This can take multiple forms.

Item statements should be simple and as short as possible; the language used should be familiar to target respondents. Keep in mind that respondents are likely to differ in educational level, as well as in vocabulary and language abilities.
Individual items must be understood by the respondent as intended by the researcher if meaningful responses are to be obtained.
Define ambiguous terms. Respondents are often unfamiliar with terms that may be considered commonplace to the test developer. This concern speaks once again to the importance of pilot testing both items and instructions.
Items that all respondents would likely answer similarly such as “This is a large department” should not be used, as they will generate little variance.
Ensure that response options (if provided) are logically ordered and mutually exclusive.
Assess choices respondents would make today, not what they plan to do in the future. For example, inquire whom an individual would vote for if the election were held today, not whom they plan on voting for in an upcoming election. While individuals are notoriously poor at predicting their own future behavior, they can report what they would do “now.”
Keep all items consistent in terms of perspective. Do not mix items that assess behaviors with items that assess affective responses (Harrison & McLaughlin, 1993). For example, don’t use “My advisor is hardworking” and “I respect my advisor” in the same measure.
Items should address only a single issue; “double-barreled” items such as “My manager is intelligent and enthusiastic” should be not be used. Such items may represent two constructs (e.g., perceived competence; perceived energy) and result in confusion on the part of the respondents.
Leading or loaded questions should be avoided, because they may bias responses. These items implicitly communicate what the “right” answer should be. For example, “Do you support or oppose restrictions on the sale of cancer-causing tobacco products to our state’s precious youth?”
Avoid double negatives. Respondents required to respond on an agreement scale often experience difficulty interpreting items that include the word “not.”
Avoid false premises. These are items that make a statement and then ask respondents to indicate their level of agreement with a second statement. For example, “Although dogs make terrific pets, some dogs just don’t belong in urban areas. If a respondent does not agree with the initial statement, how should they respond?

Caution About Reverse-Coded Items

Some might argue that the use of reverse-scored items may reduce response set bias, however it is generally suggested that they not be used.
There are many examples of problems with reverse-scored items, and the use of a few of these items randomly interspersed within a measure may have a detrimental effect on psychometric properties (Harrison & McLaughlin, 1993).

Step 3: Number of Items

A very common question in scale construction is “How many items do I need for my measure?”

There are no hard and fast rules guiding this decision, but keeping a measure short is an effective means of minimizing response biases caused by boredom or fatigue (Schmitt & Stults, 1985)

Additional items also demand more time in both the development and administration of a measure (Carmines & Zeller, 1979)

Some guidance from the literature:

At least four items per scale are needed to test the homogeneity of items within each latent construct. Four items is also the minimum number needed to identify a factor model (e.g., Harvey et al., 1985).
Adequate internal consistency reliabilities can be obtained with as few as three items, and adding items indefinitely makes progressively less impact on scale reliability (Carmines & Zeller, 1979)
It is difficult to improve on the reliabilities of five appropriate items by adding items to a scale (Hinkin, 1995; 1998).
It is also important to assure that the domain has been adequately sampled, because inadequate sampling is a primary source of measurement error (Churchill, 1979)

So, in general, a quality scale composed of four to six items could be developed for most constructs, though the final determination must be made by the researcher.

It should be anticipated that approximately one- half of the items created using the methods described here will be retained for use in the final scales, so at least twice as many items as will be needed in the final scales should be generated to be administered in a questionnaire.

Step 4: Item Scaling

Scaling refers to the process of defining the scale on which your measurements will take place.

S.S. Stevens offers perhaps the simplest and most straightforward definition of scaling:

Scaling is the assignment of objects to numbers according to a rule. Stevens (1946)

Expanding upon this a bit, Nunnally and Bernstein offer:

Measurement consists of rules for assigning symbols to objects so as to (1) represent quantities of attributes numerically (i.e., scaling) or (2) define whether the objects fall in the same or different categories with respect to a given attribute (classification) Nunnally & Bernstein (1994, p. 3).

Step 4: Item Scaling

Levels of Measurement:

Classification vs. Scaling

By systematically measuring an attribute of interest we can classify or scale individuals with regard to the attribute of interest.

Whether we engage in classification or scaling depends in large part on the level of measurement used to assess our construct. Higher levels of measurement allow for more in-depth statistical analyses.

At the nominal (i.e., qualitative) level of measurement, we only have classifications into categories, and thus scaling is not possible. There are limited data manipulations and statistical analyses we can perform on the data (e.g., we could compute frequencies and modes)
At the ordinal (i.e., rank order) level of measurement we can actually scale our construct. (e.g., we could rank responents based on the degree to which they possess our construct of interest; i.e., we have quantitative data and can compute median, range, and interquartile range, etc.)
At the interval- and ratio-level, we can calculate statistics such as means, standard deviations variances, and the various statistics of shape (e.g., skew and kurtosis).

Step 4: Item Scaling

Scaling Models: Standardized procedures that will allow us to attach meaningful numbers to the responses our subjects will ultimately provide.

A couple of different ways we apply different scaling models…

Psychological scaling – Scaling of stimuli itself; synonymous with scale development
Psychometric scaling – Scaling of people’s responses to the stimule; synonymous with scale administration

Recall that, with measurement, we are typically most interested in scaling some characteristic, trait, or ability of a person. That is, we want to know how much of an attribute of interest a given person possesses.

Our goal in doing so is to estimate the degree of interindividual and intraindividual differences among the subjects on the attribute of interest.

To achieve this goal, we can scale both…

The stimuli (e.g., the measure) that we give to individuals (i.e., psychological scaling – scaling the stimulus)
The responses that individuals provide on the measure (i.e., psychometric scaling – scaling peoples responses to the stimulus)

There is a long tradition of scaling stimuli in psychophysics (e.g. “just noticeable differences…”). In the 1920s, Louis Thurstone began to apply the same scaling principles to scaling psychological attitudes.

Step 4: Item Scaling

Generally speaking, with scaling…

We hold one factor constant (e.g., response mode)
Collapse across a second (e.g., stimuli)
Scale the third (e.g., individuals) factor.

For example, say we administered a 25-item measure of social anxiety to a group of schoolchildren. Assume all interpret the response scale (e.g., an Likert-type scale of 1–7) for each question in the same way (i.e., the response mode is held constant). However responses are not necessarily the same.

If all responses were the same…

There would be no variability
The scale is of little interest; no predictive value.

Next, we would collapse across stimuli (i.e., get a total score for the 25 items). As a result, we would be (psychometrically) scaling children on the construct of social anxiety.

Step 4: Item Scaling

Who do we select to participates in our scaling study?

Scale people (i.e., psychometrics): random sample of individuals from the population that we wish to generalize (e.g., a random sample of school-aged children)
Scale stimuli (i.e., psychological scaling): sample should be purposefully and selected based on their expertise (i.e., not randomly); subject matter experts (SMEs, e.g., experts on the measurement of social anxiety, such as clinical psychologists).

That which we hold constant also needs to be identified. In other words, we need to address the means by which we will have subjects respond to our stimuli

Put differently – What type of response scale will we use for psychometric scaling?

Such response options may include:

Comparative judgments (e.g., which is more important, A or B?)
Absolute judgments (e.g., how hot is this object?)
Subjective evaluations (e.g., strongly agree to strongly disagree)

We’ll focus mostly on the later…

We can scale stimuli at a variety of different measurement levels.

For psychological scaling, we’re going to focus on the nominal level of measurement (e.g., SMEs asked to sort the stimuli into different categories based on some dimension)

For example, we could have SMEs sort a variety of questions according to whether the items are measuring school-related social anxiety or not. We could then determine which items to remove and which to keep for further analyses.

Step 4: Item Scaling

Additional Thoughts on Scaling

It is important that the scale used generate sufficient variance among respondents for subsequent statistical analyses.

Although there are a number of different scaling techniques available, Likert-type scales are the most commonly used in survey research (Cook et al., 1981).

Coefficient alpha reliability with Likert scales has been shown to increase up to the use of five points, but then it levels off (Lissitz & Green, 1975).

Accordingly, it is suggested that the new items be scaled using five- or seven-point Likert-type scales.

If the scale is to be assessing frequency in the use of a behavior, it is very important that the researcher accurately benchmark the response range to maximize the obtained variance on a measure (Harrison & McLaughlin, 1991).

For example, if available responses range from “once” to “five or more times” on a behavior that is very frequently used, most respondents will answer at the upper end of the range, resulting in minimal variance and the probable elimination of an item that might in fact have been important but was scaled incorrectly.

Caution About Neutral Scale Options

There has been much discussion in the literature about the use of a neutral midpoint in the scale, such as “neither agree nor disagree.”

One argument, is that is not clear whether the respondent is truly neutral or is answering in this manner for other reasons (this can be addressed with instruction sets, however)

Respondents should be given the opportunity to opt out of answering a question if it does not apply to his or her situation (i.e., do not force responses).

This can be done in multiple ways:

Using a neutral midpoint, which will be a data point used in the subsequent analysis
Using a “does not apply” option, where the response would not then be included in the data

Either way, be careful to attend to such items in your analysis (e.g., assessments of item-level missingness are warranted)

Step 5: Content Validity Assessment

Once the items have been developed, they must be tested to assure that they adequately represent the content domain of interest. Because this task involves scaling the stimuli (e.g., a newly developed measure), this is an application of psychological scaling

Although procedures for assessing content validity have been widely studied for many years (e.g., Lawshe, 1976), Hinkin (1995, 1998) notes that problems continue the content validity of measures.

Historically, content validity has been assessed using experts to sort items using a variety of indices, and then using factor analysis to aggregate items into scales.

Unfortunately, there is a lot of subjectivity in this process as the researcher has to make judgments regarding the number of factors to retain (e.g., using eigenvalues, scree plots, the “Kaiser criterion,” etc.) and about the magnitude of loadings for item retention (e.g., >= .70).

This type of judgment relies on heuristics and/or convention and it subsequently introduces a degree of uncertainty into the interpretation and meaning of the focal construct(s).

In addition, factor analytic techniques typically require larger sample sizes to achieve an adequate ratio of respondents (\(N\)) to items (\(K\)), or the \(N\)-to-\(K\) ratio.

Step 5: Content Validity Assessment

In general, this process involves having “experts” (i.e., people who are knowledgeable in the content are; subject matter experts or “SMEs”) review the item pool

A very general procedure:

SMEs should be given a working definition of the construct
Ask them to rate the relevance of each item to the construct as you have defined it
Ask for comments on individual items (e.g., clarity, conciseness, alternative wordings)

Step 5: Content Validity Assessment

Hinkin and Tracey (1999) propose a ANOVA-based procedure to assess content adequacy (a term similar to, though somewhat distinct from, content validity, which cannot be demonstrated with a single inference) that overcomes some of the pitfalls of methods that rely on factor analysis.

A slightly-modified approach to the Hinkin and Tracey (1999) procedure is described here:

Start with definitions of the new construct(s) to be assessed and the items (i.e. approx. 5-10/construct) that have been generated using either the deductive or inductive methods described earlier.
Then, existing measures of similar yet different constructs must be obtained and their definitions clarified.
For each new measure being developed, at least one similar measure should be included in the analysis. The definitions all constructs are then presented at the top the questionnaire, followed by a randomized listing of all items.
Respondents then rate each of the items on the extent to which they believe that the items are consistent with each of the construct definitions. Response choices should range from 1 (not at all) to 5 (completely).
Items should be presented to respondents in a random order to control for response bias that may occur from order effects.
This results in a dataset wherein each rater provides a rating of each item in terms of the extent to which they believe that it is consistent with each of the construct definitions.
Then, a comparison of means is conducted for each item across the definitions to identify those items that are evaluated appropriately (i.e., to identify items that were statistically significantly higher on the appropriate definition).
Given the inherently nested structure of these data, mixed-effects modeling can be used to compare item means; those items that are rated significantly higher on the appropriate dimensions should be retained for subsequent analyses.
At this point the researcher can be fairly confident that the new measures adequately represent the construct or constructs under examination.

Step 5: Content Validity Assessment

To illustrate this, let’s consider the contrived data collected from a fictitious psychological scaling study conducted among \(n\) = 40 SMEs, who rated 10 items total, five of which belong to “construct #1” and five of which belong to “construct #2.”

Let’s assume the goal of this study is to demonstrate the content adequacy of items developed for a new measure of “construct #1” against a related construct, “construct #2”

To get a sense of the data structure, let’s look at just the first rater’s data:

sme_ratings <- readxl::read_excel("Data/sme_ratings.xlsx")

sme_ratings[1:20,] %>%
  knitr::kable() %>%
  kableExtra::kable_styling(full_width = F)

Rater	Construct	Item	Rated_Construct	Rating
1	Construct #1	Item #1	Rated Construct #1	4
1	Construct #1	Item #2	Rated Construct #1	5
1	Construct #1	Item #3	Rated Construct #1	5
1	Construct #1	Item #4	Rated Construct #1	4
1	Construct #1	Item #5	Rated Construct #1	5
1	Construct #2	Item #1	Rated Construct #1	1
1	Construct #2	Item #2	Rated Construct #1	1
1	Construct #2	Item #3	Rated Construct #1	1
1	Construct #2	Item #4	Rated Construct #1	2
1	Construct #2	Item #5	Rated Construct #1	1
1	Construct #1	Item #1	Rated Construct #2	2
1	Construct #1	Item #2	Rated Construct #2	2
1	Construct #1	Item #3	Rated Construct #2	2
1	Construct #1	Item #4	Rated Construct #2	2
1	Construct #1	Item #5	Rated Construct #2	2
1	Construct #2	Item #1	Rated Construct #2	4
1	Construct #2	Item #2	Rated Construct #2	4
1	Construct #2	Item #3	Rated Construct #2	5
1	Construct #2	Item #4	Rated Construct #2	4
1	Construct #2	Item #5	Rated Construct #2	5

Step 5: Content Validity Assessment

Then, we can run a mixed-effects model regressing Rating onto Construct, Item, Rated_Construct and their interactions, with a random effect for each rater to account for the nesting of ratings within raters.

We can summarize the model terms as a Type III ANOVA (n.b., this is akin to a “repeated measures” ANOVA model)

Note the statistically-significant Construct by Rated_Construct interaction:

sme_model <-lme4::lmer(Rating~Construct*Item*Rated_Construct+(1|Rater), data=sme_ratings)

sme_model %>%
  car::Anova(type="III")

Analysis of Deviance Table (Type III Wald chisquare tests)

Response: Rating
                                   Chisq Df Pr(>Chisq)    
(Intercept)                    3157.8738  1    < 2e-16 ***
Construct                       670.5648  1    < 2e-16 ***
Item                              9.6877  4    0.04603 *  
Rated_Construct                 693.8870  1    < 2e-16 ***
Construct:Item                    7.1362  4    0.12886    
Construct:Rated_Construct      1387.7741  1    < 2e-16 ***
Item:Rated_Construct              2.8505  4    0.58315    
Construct:Item:Rated_Construct    1.7243  4    0.78631    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Step 5: Content Validity Assessment

Next, we can specify the estimated marginal means from this model, summarizing the mean item ratings within each rated construct. Given the interaction noted above, we expect differences by construct and rated construct, but not by item.

“Higher” scores indicate that any given item is a “better” reflection of the underlying definition of a given construct:

emmeans_sme_model <- emmeans::emmeans(sme_model, specs=c("Construct", "Item", "Rated_Construct")) %>% data.frame()

emmeans_sme_model %>%
  knitr::kable() %>%
  kableExtra::kable_styling(full_width = F)

Construct	Item	Rated_Construct	emmean	SE	df	lower.CL	upper.CL
Construct #1	Item #1	Rated Construct #1	4.450	0.0791886	780	4.294552	4.605448
Construct #2	Item #1	Rated Construct #1	1.550	0.0791886	780	1.394552	1.705448
Construct #1	Item #2	Rated Construct #1	4.600	0.0791886	780	4.444552	4.755448
Construct #2	Item #2	Rated Construct #1	1.450	0.0791886	780	1.294552	1.605448
Construct #1	Item #3	Rated Construct #1	4.625	0.0791886	780	4.469552	4.780448
Construct #2	Item #3	Rated Construct #1	1.500	0.0791886	780	1.344552	1.655448
Construct #1	Item #4	Rated Construct #1	4.325	0.0791886	780	4.169552	4.480448
Construct #2	Item #4	Rated Construct #1	1.525	0.0791886	780	1.369552	1.680448
Construct #1	Item #5	Rated Construct #1	4.450	0.0791886	780	4.294552	4.605448
Construct #2	Item #5	Rated Construct #1	1.500	0.0791886	780	1.344552	1.655448
Construct #1	Item #1	Rated Construct #2	1.500	0.0791886	780	1.344552	1.655448
Construct #2	Item #1	Rated Construct #2	4.500	0.0791886	780	4.344552	4.655448
Construct #1	Item #2	Rated Construct #2	1.475	0.0791886	780	1.319552	1.630448
Construct #2	Item #2	Rated Construct #2	4.425	0.0791886	780	4.269552	4.580448
Construct #1	Item #3	Rated Construct #2	1.675	0.0791886	780	1.519552	1.830448
Construct #2	Item #3	Rated Construct #2	4.525	0.0791886	780	4.369552	4.680448
Construct #1	Item #4	Rated Construct #2	1.450	0.0791886	780	1.294552	1.605448
Construct #2	Item #4	Rated Construct #2	4.475	0.0791886	780	4.319552	4.630448
Construct #1	Item #5	Rated Construct #2	1.525	0.0791886	780	1.369552	1.680448
Construct #2	Item #5	Rated Construct #2	4.475	0.0791886	780	4.319552	4.630448

Step 5: Content Validity Assessment

These means are hard to interpret, but a visualization makes the task easier:

pd <- position_dodge(0.25)

ggplot2::ggplot(emmeans_sme_model, aes(x=Construct, y=emmean, colour=Item)) + 
  facet_wrap(~Rated_Construct) +
  geom_errorbar(aes(ymin = lower.CL, ymax = upper.CL), width=.1, position=pd)+
  geom_point(position=pd)

Phase #3: Testing the Measure

At this point the researcher can be fairly confident that their new measure(s) adequately represent the construct(s) under examination.

However, it is now necessary to administer them to samples that are representative of the target population for further testing (i.e., psychometric scaling).

Ideally, we would want several independent samples to establish initial evidence for item operations, factor structures (EFA & CFA), and evidence of convergent, discriminant, and criterion-related validity.

You can think of the next few steps as unfolding as a series of independent studies that are each purposefully conducted to demonstrate the operation of your measure(s).

Phase #3: Testing the Measure

Citations

Cook, J.D., Hepworth, S. J., Wall, T. D., & Warr, P. B. (1981). Experience of work: A compendium and review of 249 measures and their use. New York, NY: Academic Press.

Williamson, J., Karp, D., Dalphin, J. & Gray, P. (1982) Chapter 7: Intensive interviewing, The Research Craft, Little, Brown & Co.

Kerlinger,F. (1986). Foundations of behavioral research (3rd ed.). Holt, Rinehart & Winston.

Nunnally, J.C. (1978). Psychometric theory (2nd ed.). McGraw-Hill.

Conway, J. M., & Huffcutt, A. I. (2003). A review and evaluation of exploratory factor analysis practices in organizational research. Organizational research methods, 6(2), 147-168.

Harrison, D. A., & McLaughlin, M. E. (1993). Cognitive processes in self-report responses: Tests of item context effects in work attitude measures. Journal of Applied Psychology, 78(1), 129–140. https://doi.org/10.1037/0021-9010.78.1.129

Schmitt, N.W., & Stults, D.M. (1985). Factors defined by negatively keyed items: The results of careless respondents? Applied Psychological Measurement, 9, 367–373.

Carmines, E.G., & Zeller, R.A. (1979). Reliability and validity assessment. SAGE.

Harvey, R.J., Billings, R.S.,& Nilan, K.J. (1985). Confirmatory factor analysis of the job diagnostic survey: Good news and bad news. Journal of Applied Psychology, 70, 461–468.

Cook, J.D., Hepworth, S.J., Wall, T.D., & Warr, P.B. (1981). The experience of work. Academic Press.

Carmines, E.G., & Zeller, R.A. (1979). Reliability and validity assessment. SAGE.

Hinkin, T.R. (1995). A review of scale development practices in the study of organizations. Journal of Management, 21, 967–988.

Hinkin, T.R. (1998). A brief tutorial on the development of measures for use in survey questionnaires. Organizational Research Methods, 1, 104–121.

Churchill, G.A. (1979). A paradigm for developing better measures of marketing constructs. Journal of Marketing Research, 16, 64–73.

Lissitz,R.W., & Green, S.B.(1975).Effect of the number of scale points on reliability:A Monte Carlo approach.Journal of Applied Psychology,60,10–13.

Lawshe, C. H. (1975). A quantitative approach to content validity. Personnel psychology, 28(4), 563-575.

Hinkin,T.R.,& Tracey,J.B.(1999).An analysis of variance approach to content validation.Organizational Research Methods,2(2),175–186.

Stevens, S. S. (1946). On the theory of scales of measurement. Science, 103(2684), 677-680.