Topic #2: Developing Psychometric Scales
Wayne State University
1 September, 2025
A general model of measurement:
Where…
There is a lot of guidance in the literature regarding “best practices” for developing psychometric scales
In general, this process unfolds in three major phases, each with several steps:
Phase #1: Developing a Measurement Theory
- Step 1: Define the Construct and its Content Domain
- Step 2: Specification of Measurement Properties
Phase #2: Developing the Measure
- Step 1: Item Generation
- Step 2: Item Wording
- Step 3: Number of Items
- Step 4: Item Scaling
- Step 5: Content Validity Assessment
Phase #3: Testing the Measure
- Step 1: Selecting a Sample
- Step 2: Preliminary Factor Analysis
- Step 3: Confirmatory Factor Analysis
- Step 4: Internal Consistency Assessment
- Step 5: Convergent and Discriminant Validity
- Step 6: Criterion-Related Validity
- Step 7: Replication
The first, and most important step of developing a new measure is to clearly define the construct that you intend it to reflect
The “domain sampling” or “content oriented” approach to test construction starts by situating the construct of interest within theory by defining it in terms of it’s content domain
A construct’s content domain refers to the set of interrelated attributes (e.g., knowledge, skills, abilities, behaviors, attitudes, values, ways of thinking, mindsets, feelings, etc.) which are included under a construct label
To define a content domain, you must indicate what you intend to measure. This may emerge inductively or deductively, with the latter being grounded in/informed by theory.
Some questions to ask yourself about your construct at this stage:
Once established, it’s good practice to review your construct and content domain definitions with subject matter experts (SME’s) and potentially revise your definitions based on that review. This is especially important if your are developing a measure about something you don’t know much about (e.g., job-specific knowledge).
Consider, for example, the following construct and content domain definitions for trait conscientiousness:
Construct Definition: Conscientiousness is the personality trait of being thorough, careful, and/or vigilant; Conscientiousness implies a desire to perform tasks well
Content Domain: Conscientious people…
Conscientiousness is manifested in characteristic behaviors such as…
Another example…
Weaker:
Confidence in Wine Knowledge is defined as one’s self-perceived knowledge regarding wine.
Stronger:
Confidence in Wine Knowledge is defined as one’s self-perceived knowledge regarding wine topics that include wine production (i.e., how the wine is harvested and then made), wine storage (i.e., details regarding proper ways to store wine and for how long), and grape varietals (i.e., the relation between types of grapes and wine categories, how varietals influence the taste of wine). Those with high levels of Confidence in Wine Knowledge tend to have more confidence in their general wine knowledge and believe themselves to be knowledgeable in these domains, such that they believe they are wine experts. High scorers will indicate they know a lot about wine, are knowledgeable of the different wine varietals, possess confidence in their understanding of wine production, and claim knowledge regarding wine storage. Those with low scores tend to have less confidence in their wine knowledge. This definition was inspired by prior research of McClung, Freeman, & Malone (2015), who looked at confidence in wine primarily at the consumer level of purchasing/selecting wines. The Confidence in Wine Knowledge construct, while related, deals explicitly with the aforementioned confidence topics of wine production, wine storage, and wine varietals. This construct should be related to experience with wine (e.g., drinking wine, learning about wine, fine dining).
Content domains can include aspects or “levels” of a construct defined from different psychological perspectives (e.g., cognitive, affective, behavioral)
For example, consider the construct of “overwork expectations,” which refers to undue demands put on people’s expectations to complete their work beyond normal business hours and several possible ways that this idea could be assessed:
There are several “specification decisions” that must be made initially that guide the development of desired measure.
These decisions are largely guided by a) the construct definition and b) its content domain
Decision #1: Type of Measure
Measures of maximal performance: intended to assess an individual’s all-out effort; typically have a “known” correct answer (e.g., aptitude tests or achievement tests, including classroom exams and personnel selection tests).
Aptitude Test: A test that emphasizes innate potential and informal learning, and is used to predict future performance and/or behavior.
Achievement Test: A test that emphasizes what an individual currently knows or can do with regard to a particular subject matter.
Measures of typical performance: No single “correct” response (e.g., personality and interest inventories; attitude scales)
Decision #2: Use & Contexts
The major concern regarding use and context at this stage is “domain specificity,” which broadly refers to whether the measure is intended to assess something that applies to all people and is somewhat constant across contexts (i.e., “domain general”) or whether the measure is intended to assess something that is applied to/manifests uniquely in a particular situation or context (i.e., “domain specific”)
For example, we could assess trait conscientiousness at work (i.e., domain specific) versus in day-to-day life (i.e., domain general)
There are benefits of limiting the content domain to that context in which we are specifically interested, including:
A couple of examples of questionable face validity:
Decision #3: Define the Nomological Network
A nomological network1 refers to a “lawful pattern” of interrelationships that exists between hypothetical constructs and observable attributes that serves to guide researchers in establishing the construct validity of a psychological test or measure.
A nomological network includes a theoretical framework for what is being measured, specifying linkages between different…
Qualitatively different measurement operations (e.g., self-reports; other-reports) may be said to measure the same attributes if their locations in the nomological network link them to the same hypothetical construct variable.
Decision #3: Define the Nomological Network
In developing a nomological network, it’s important to be able to distinguish differences between your construct and related constructs.
Thus, an essential question at this stage is, “How is this construct similar to and different from related constructs? This leads to two related ideas:
Convergent/Discriminant Validity:
A Caution about Jingle/Jangle Fallacies:
Decision #3: Define the Nomological Network
Contrived Nomological Network for Self-Focused Constructs
Decision #4: Dimensionality
Contrived Multidimensional Structure of Conservatism
Decision #4: Dimensionality
Unidimensionality: Some constructs are fairly narrow in their nature, and can be represented with a single, underlying latent variable
Multidimensionality: Many constructs are fairly broad in their nature, but are themselves composed of several related components that must be measured to assess the construct fully. Such constructs must be represented with multiple underlying latent variables
If a construct is multidimensional, are all dimensions weighted equally? If not, how much weight should each dimension or facet have in the final scale?
This breaks down to a pretty fundamental question regarding your theory of measurement: Are all dimensions equally important, or are some more important than others?
There are multiple weighting strategies you could adopt, but most simply the number of test items per dimension should reflect these decisions
Decision #4: Format & Administration
Format concerns the question, “How will responses be collected?” There are two broad options regarding format, but these are not mutually exclusive (i.e., many measures include both options).
First, free/open-ended responses (e.g., interviews, questionnaires, essay examinations, projective tests, etc.)
Second, limited/closed-ended/constructed responses (e.g., multiple choice exams, scales using Likert-type response options)
There are numerous advantages/disadvantages of each in terms of…
The expertise required for test administration (i.e., closed ended items minimize expertise required for test administration)
Analysis/scoring (i.e., it’s easier to analyze closed-ended items because there is little to no subjective judgment or coding required; responses to open-ended items often contain irrelevant information)
Clarification and elaboration (i.e., open ended items allow respondents to clarify their answers; reduce possibility of guessing or cueing).
Free response can be collected in multiple formats:
Constructed responses likewise have multiple formats:
Regarding performance-based tests, there are two general classifications:
Decision #4: Format & Administration
Administration can be done individually, which implies one-on-one administration (e.g., WAIS/WISC; Clinical Assessments) or can be done via group administration (i.e., multiple people at one time can complete the measure; e.g., Wonderlic, SAT; online testing and surveys)
There are advantages/disadvantages of each, that are determined by time and resources
Decision #4: Format & Administration
Test length is likewise a concern for administration
There are advantages (e.g., reliability; opportunity to capture “more” of the content domain, which bolsters construct validity) and disadvantages (i.e., test length is positively associated with test-taker fatigue; negatively associated with test-taker motivation) of longer tests
At this stage, test length is a background concern. During development, the question really should be “How many items need to be initially developed?”
How many items desired for the final version of the test?
Decision #4: Format & Administration
Test Difficulty concerns the question, “What is the likely range of abilities of test-takers?”
The difficulty of items is dependent on the ability of the sample of the population tested (e.g., WISC vs. WAIS)
At this stage, having a clear idea of the abilities of potential test takers is important to ensure appropriate difficulty for the population.
Maximal performance: item difficulty can be roughly determined based on the percentage of test takers who get the item correct.
Typical performance: readability
Items that are written at a level beyond the likely educational attainment of the targeted population of test takers will increase error variance in responses, leading to a less reliable test.
Decision #5: Item Specifications
The process of item specification refers to developing and/or adopting a set of rules that dictate item format. Generally, you would choose between how to format a) the item stem and b) the response format
For example…
All items will contain a short stem of 1-2 sentences. Stems will pose a question or statement, and response options will include potential answers to that question, or indications of how much respondents agree with the statement (depending on the form of the stem). Response options should not be longer than a few words. There will be four response options per item, and the response format will be forced-choice such that the respondent must choose only one response.
Decision #5: Item Specifications
A quick note on terminology: Moving from responses to scores
The term response options refers to the choices that respondents choose from (i.e., “item responses”, “raw responses”)
From respondents’ choices on such options, the analyst scores the item (i.e., creates “item scores”)
Decision #5: Item Specifications
There are numerous item formats that you can choose from, and the choice to some degree hinges on whether the measure is designed to assess typical (e.g., attitudes, values, behavior; no clear “correct” or “incorrect” answer) or maximal performance (i.e., knowledge, skills, ability; clear “correct” or “incorrect” answer)
We will focus next on the process of scale development with an eye towards assuring that the ultimate measure that you develop possesses sound psychometric properties 1
We will focus on developing measures with multiple subscales (i.e., “multidimensional” scales; n.b., the process would be the same, although less complex, for developing a single multi-item scale).
Several criteria have been proposed for assessing the psychometric soundness of behavioral measures, all of which contribute to providing evidence of construct validity, which refers, generally, to the relationship between the measure and the underlying construct it is attempting to assess:
Each step of the process described here will contribute to increasing the confidence in the overall construct validity of the new measure.
The first stage of scale development is the creation of items to assess the construct under examination. At this point, the researcher’s goal is to develop items that will result in measures that adequately sample the domain of interest to demonstrate content validity.
A strong theoretical rationale that provides the conceptual definitions to be operationalized by the scales under development is a necessary but not sufficient condition for establishing construct validity of the new measure.
Which is to say that, all else being equal, stronger theory initially should translate into stronger construct validity eventually
The most important idea here is that the scales should be evaluated and refined before they are used to collect data from a sample of your intended population.
The two greatest “expenses” in conducting any study are a) the cost of the survey administration (i.e., data collection, time of the researcher and respondents) and b) access to potential appropriate samples.
Thus, it is critically important that the survey measures taken into the field are psychometrically sound.
There are two basic approaches to item development.
Deductive scale development requires the use of a theoretically-based classification schema or typology prior to data collection.
This approach requires an understanding of the phenomenon to be investigated and a thorough review of the literature to develop the theoretical definition of the construct under examination.
The definition is then used as a guide for the development of items
Advantages:
Disadvantages:
Inductive scale development is so labeled because often little theory is involved at the outset, as one attempts to generate measures from individual items.
This approach requires researchers to ask a sample of respondents to provide descriptions of their feelings or to describe some aspect of their behavior.
Responses are then classified into a number of categories by content analysis based on key words or themes (Williamson et al., 1992) or a sorting process such as the Q-sorting technique with an agreement index of some type, usually using multiple judges (Kerlinger, 1986).
From these categorized responses, items are derived for subsequent analysis.
Advantages:
Disadvantages:
“Good” item writers should be SME’s, understand the people that will ultimately be completing the measure, have a mastery of verbal communication, and be resourceful (i.e., look for inspiration)
The task now is to generate items to target content that maps onto our construct definition and content domain
A good analogy is that this process is like sampling, hence the term “domain sampling,” The construct definition and content domain should serve as a sampling plan that helps you to that meet the specifications outlined previously
Your goal here is to sample items from the entire content domain to maximize domain coverage. Note, this is not the same as construct redundancy.
There are number of additional guidelines for writing items, but the common concerns when developing a new measure boil down to parsimony. This can take multiple forms.
Item statements should be simple and as short as possible; the language used should be familiar to target respondents. Keep in mind that respondents are likely to differ in educational level, as well as in vocabulary and language abilities.
Individual items must be understood by the respondent as intended by the researcher if meaningful responses are to be obtained.
Define ambiguous terms. Respondents are often unfamiliar with terms that may be considered commonplace to the test developer. This concern speaks once again to the importance of pilot testing both items and instructions.
Items that all respondents would likely answer similarly such as “This is a large department” should not be used, as they will generate little variance.
Ensure that response options (if provided) are logically ordered and mutually exclusive.
Assess choices respondents would make today, not what they plan to do in the future. For example, inquire whom an individual would vote for if the election were held today, not whom they plan on voting for in an upcoming election. While individuals are notoriously poor at predicting their own future behavior, they can report what they would do “now.”
Keep all items consistent in terms of perspective. Do not mix items that assess behaviors with items that assess affective responses (Harrison & McLaughlin, 1993). For example, don’t use “My advisor is hardworking” and “I respect my advisor” in the same measure.
Items should address only a single issue; “double-barreled” items such as “My manager is intelligent and enthusiastic” should be not be used. Such items may represent two constructs (e.g., perceived competence; perceived energy) and result in confusion on the part of the respondents.
Leading or loaded questions should be avoided, because they may bias responses. These items implicitly communicate what the “right” answer should be. For example, “Do you support or oppose restrictions on the sale of cancer-causing tobacco products to our state’s precious youth?”
Avoid double negatives. Respondents required to respond on an agreement scale often experience difficulty interpreting items that include the word “not.”
Avoid false premises. These are items that make a statement and then ask respondents to indicate their level of agreement with a second statement. For example, “Although dogs make terrific pets, some dogs just don’t belong in urban areas. If a respondent does not agree with the initial statement, how should they respond?
Caution About Reverse-Coded Items
Some might argue that the use of reverse-scored items may reduce response set bias, however it is generally suggested that they not be used.
There are many examples of problems with reverse-scored items, and the use of a few of these items randomly interspersed within a measure may have a detrimental effect on psychometric properties (Harrison & McLaughlin, 1993).
A very common question in scale construction is “How many items do I need for my measure?”
There are no hard and fast rules guiding this decision, but keeping a measure short is an effective means of minimizing response biases caused by boredom or fatigue (Schmitt & Stults, 1985)
Additional items also demand more time in both the development and administration of a measure (Carmines & Zeller, 1979)
Some guidance from the literature:
At least four items per scale are needed to test the homogeneity of items within each latent construct. Four items is also the minimum number needed to identify a factor model (e.g., Harvey et al., 1985).
Adequate internal consistency reliabilities can be obtained with as few as three items, and adding items indefinitely makes progressively less impact on scale reliability (Carmines & Zeller, 1979)
It is difficult to improve on the reliabilities of five appropriate items by adding items to a scale (Hinkin, 1995; 1998).
It is also important to assure that the domain has been adequately sampled, because inadequate sampling is a primary source of measurement error (Churchill, 1979)
So, in general, a quality scale composed of four to six items could be developed for most constructs, though the final determination must be made by the researcher.
It should be anticipated that approximately one- half of the items created using the methods described here will be retained for use in the final scales, so at least twice as many items as will be needed in the final scales should be generated to be administered in a questionnaire.
Scaling refers to the process of defining the scale on which your measurements will take place.
S.S. Stevens offers perhaps the simplest and most straightforward definition of scaling:
Scaling is the assignment of objects to numbers according to a rule. Stevens (1946)
Expanding upon this a bit, Nunnally and Bernstein offer:
Measurement consists of rules for assigning symbols to objects so as to (1) represent quantities of attributes numerically (i.e., scaling) or (2) define whether the objects fall in the same or different categories with respect to a given attribute (classification) Nunnally & Bernstein (1994, p. 3).
Levels of Measurement:
Classification vs. Scaling
By systematically measuring an attribute of interest we can classify or scale individuals with regard to the attribute of interest.
Whether we engage in classification or scaling depends in large part on the level of measurement used to assess our construct. Higher levels of measurement allow for more in-depth statistical analyses.
At the nominal (i.e., qualitative) level of measurement, we only have classifications into categories, and thus scaling is not possible. There are limited data manipulations and statistical analyses we can perform on the data (e.g., we could compute frequencies and modes)
At the ordinal (i.e., rank order) level of measurement we can actually scale our construct. (e.g., we could rank responents based on the degree to which they possess our construct of interest; i.e., we have quantitative data and can compute median, range, and interquartile range, etc.)
At the interval- and ratio-level, we can calculate statistics such as means, standard deviations variances, and the various statistics of shape (e.g., skew and kurtosis).
Scaling Models: Standardized procedures that will allow us to attach meaningful numbers to the responses our subjects will ultimately provide.
A couple of different ways we apply different scaling models…
Recall that, with measurement, we are typically most interested in scaling some characteristic, trait, or ability of a person. That is, we want to know how much of an attribute of interest a given person possesses.
Our goal in doing so is to estimate the degree of interindividual and intraindividual differences among the subjects on the attribute of interest.
To achieve this goal, we can scale both…
There is a long tradition of scaling stimuli in psychophysics (e.g. “just noticeable differences…”). In the 1920s, Louis Thurstone began to apply the same scaling principles to scaling psychological attitudes.
Generally speaking, with scaling…
For example, say we administered a 25-item measure of social anxiety to a group of schoolchildren. Assume all interpret the response scale (e.g., an Likert-type scale of 1–7) for each question in the same way (i.e., the response mode is held constant). However responses are not necessarily the same.
If all responses were the same…
Next, we would collapse across stimuli (i.e., get a total score for the 25 items). As a result, we would be (psychometrically) scaling children on the construct of social anxiety.
Who do we select to participates in our scaling study?
Scale people (i.e., psychometrics): random sample of individuals from the population that we wish to generalize (e.g., a random sample of school-aged children)
Scale stimuli (i.e., psychological scaling): sample should be purposefully and selected based on their expertise (i.e., not randomly); subject matter experts (SMEs, e.g., experts on the measurement of social anxiety, such as clinical psychologists).
That which we hold constant also needs to be identified. In other words, we need to address the means by which we will have subjects respond to our stimuli
Put differently – What type of response scale will we use for psychometric scaling?
Such response options may include:
We’ll focus mostly on the later…
We can scale stimuli at a variety of different measurement levels.
For psychological scaling, we’re going to focus on the nominal level of measurement (e.g., SMEs asked to sort the stimuli into different categories based on some dimension)
For example, we could have SMEs sort a variety of questions according to whether the items are measuring school-related social anxiety or not. We could then determine which items to remove and which to keep for further analyses.
Additional Thoughts on Scaling
It is important that the scale used generate sufficient variance among respondents for subsequent statistical analyses.
Although there are a number of different scaling techniques available, Likert-type scales are the most commonly used in survey research (Cook et al., 1981).
Coefficient alpha reliability with Likert scales has been shown to increase up to the use of five points, but then it levels off (Lissitz & Green, 1975).
Accordingly, it is suggested that the new items be scaled using five- or seven-point Likert-type scales.
If the scale is to be assessing frequency in the use of a behavior, it is very important that the researcher accurately benchmark the response range to maximize the obtained variance on a measure (Harrison & McLaughlin, 1991).
For example, if available responses range from “once” to “five or more times” on a behavior that is very frequently used, most respondents will answer at the upper end of the range, resulting in minimal variance and the probable elimination of an item that might in fact have been important but was scaled incorrectly.
Caution About Neutral Scale Options
There has been much discussion in the literature about the use of a neutral midpoint in the scale, such as “neither agree nor disagree.”
One argument, is that is not clear whether the respondent is truly neutral or is answering in this manner for other reasons (this can be addressed with instruction sets, however)
Respondents should be given the opportunity to opt out of answering a question if it does not apply to his or her situation (i.e., do not force responses).
This can be done in multiple ways:
Either way, be careful to attend to such items in your analysis (e.g., assessments of item-level missingness are warranted)
Once the items have been developed, they must be tested to assure that they adequately represent the content domain of interest. Because this task involves scaling the stimuli (e.g., a newly developed measure), this is an application of psychological scaling
Although procedures for assessing content validity have been widely studied for many years (e.g., Lawshe, 1976), Hinkin (1995, 1998) notes that problems continue the content validity of measures.
Historically, content validity has been assessed using experts to sort items using a variety of indices, and then using factor analysis to aggregate items into scales.
Unfortunately, there is a lot of subjectivity in this process as the researcher has to make judgments regarding the number of factors to retain (e.g., using eigenvalues, scree plots, the “Kaiser criterion,” etc.) and about the magnitude of loadings for item retention (e.g., >= .70).
This type of judgment relies on heuristics and/or convention and it subsequently introduces a degree of uncertainty into the interpretation and meaning of the focal construct(s).
In addition, factor analytic techniques typically require larger sample sizes to achieve an adequate ratio of respondents (\(N\)) to items (\(K\)), or the \(N\)-to-\(K\) ratio.
In general, this process involves having “experts” (i.e., people who are knowledgeable in the content are; subject matter experts or “SMEs”) review the item pool
A very general procedure:
Hinkin and Tracey (1999) propose a ANOVA-based procedure to assess content adequacy (a term similar to, though somewhat distinct from, content validity, which cannot be demonstrated with a single inference) that overcomes some of the pitfalls of methods that rely on factor analysis.
A slightly-modified approach to the Hinkin and Tracey (1999) procedure is described here:
Start with definitions of the new construct(s) to be assessed and the items (i.e. approx. 5-10/construct) that have been generated using either the deductive or inductive methods described earlier.
Then, existing measures of similar yet different constructs must be obtained and their definitions clarified.
For each new measure being developed, at least one similar measure should be included in the analysis. The definitions all constructs are then presented at the top the questionnaire, followed by a randomized listing of all items.
Respondents then rate each of the items on the extent to which they believe that the items are consistent with each of the construct definitions. Response choices should range from 1 (not at all) to 5 (completely).
Items should be presented to respondents in a random order to control for response bias that may occur from order effects.
This results in a dataset wherein each rater provides a rating of each item in terms of the extent to which they believe that it is consistent with each of the construct definitions.
Then, a comparison of means is conducted for each item across the definitions to identify those items that are evaluated appropriately (i.e., to identify items that were statistically significantly higher on the appropriate definition).
Given the inherently nested structure of these data, mixed-effects modeling can be used to compare item means; those items that are rated significantly higher on the appropriate dimensions should be retained for subsequent analyses.
At this point the researcher can be fairly confident that the new measures adequately represent the construct or constructs under examination.
To illustrate this, let’s consider the contrived data collected from a fictitious psychological scaling study conducted among \(n\) = 40 SMEs, who rated 10 items total, five of which belong to “construct #1” and five of which belong to “construct #2.”
Let’s assume the goal of this study is to demonstrate the content adequacy of items developed for a new measure of “construct #1” against a related construct, “construct #2”
To get a sense of the data structure, let’s look at just the first rater’s data:
sme_ratings <- readxl::read_excel("Data/sme_ratings.xlsx")
sme_ratings[1:20,] %>%
knitr::kable() %>%
kableExtra::kable_styling(full_width = F)| Rater | Construct | Item | Rated_Construct | Rating |
|---|---|---|---|---|
| 1 | Construct #1 | Item #1 | Rated Construct #1 | 4 |
| 1 | Construct #1 | Item #2 | Rated Construct #1 | 5 |
| 1 | Construct #1 | Item #3 | Rated Construct #1 | 5 |
| 1 | Construct #1 | Item #4 | Rated Construct #1 | 4 |
| 1 | Construct #1 | Item #5 | Rated Construct #1 | 5 |
| 1 | Construct #2 | Item #1 | Rated Construct #1 | 1 |
| 1 | Construct #2 | Item #2 | Rated Construct #1 | 1 |
| 1 | Construct #2 | Item #3 | Rated Construct #1 | 1 |
| 1 | Construct #2 | Item #4 | Rated Construct #1 | 2 |
| 1 | Construct #2 | Item #5 | Rated Construct #1 | 1 |
| 1 | Construct #1 | Item #1 | Rated Construct #2 | 2 |
| 1 | Construct #1 | Item #2 | Rated Construct #2 | 2 |
| 1 | Construct #1 | Item #3 | Rated Construct #2 | 2 |
| 1 | Construct #1 | Item #4 | Rated Construct #2 | 2 |
| 1 | Construct #1 | Item #5 | Rated Construct #2 | 2 |
| 1 | Construct #2 | Item #1 | Rated Construct #2 | 4 |
| 1 | Construct #2 | Item #2 | Rated Construct #2 | 4 |
| 1 | Construct #2 | Item #3 | Rated Construct #2 | 5 |
| 1 | Construct #2 | Item #4 | Rated Construct #2 | 4 |
| 1 | Construct #2 | Item #5 | Rated Construct #2 | 5 |
Then, we can run a mixed-effects model regressing Rating onto Construct, Item, Rated_Construct and their interactions, with a random effect for each rater to account for the nesting of ratings within raters.
We can summarize the model terms as a Type III ANOVA (n.b., this is akin to a “repeated measures” ANOVA model)
Note the statistically-significant Construct by Rated_Construct interaction:
sme_model <-lme4::lmer(Rating~Construct*Item*Rated_Construct+(1|Rater), data=sme_ratings)
sme_model %>%
car::Anova(type="III")Analysis of Deviance Table (Type III Wald chisquare tests)
Response: Rating
Chisq Df Pr(>Chisq)
(Intercept) 3157.8738 1 < 2e-16 ***
Construct 670.5648 1 < 2e-16 ***
Item 9.6877 4 0.04603 *
Rated_Construct 693.8870 1 < 2e-16 ***
Construct:Item 7.1362 4 0.12886
Construct:Rated_Construct 1387.7741 1 < 2e-16 ***
Item:Rated_Construct 2.8505 4 0.58315
Construct:Item:Rated_Construct 1.7243 4 0.78631
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Next, we can specify the estimated marginal means from this model, summarizing the mean item ratings within each rated construct. Given the interaction noted above, we expect differences by construct and rated construct, but not by item.
“Higher” scores indicate that any given item is a “better” reflection of the underlying definition of a given construct:
emmeans_sme_model <- emmeans::emmeans(sme_model, specs=c("Construct", "Item", "Rated_Construct")) %>% data.frame()
emmeans_sme_model %>%
knitr::kable() %>%
kableExtra::kable_styling(full_width = F)| Construct | Item | Rated_Construct | emmean | SE | df | lower.CL | upper.CL |
|---|---|---|---|---|---|---|---|
| Construct #1 | Item #1 | Rated Construct #1 | 4.450 | 0.0791886 | 780 | 4.294552 | 4.605448 |
| Construct #2 | Item #1 | Rated Construct #1 | 1.550 | 0.0791886 | 780 | 1.394552 | 1.705448 |
| Construct #1 | Item #2 | Rated Construct #1 | 4.600 | 0.0791886 | 780 | 4.444552 | 4.755448 |
| Construct #2 | Item #2 | Rated Construct #1 | 1.450 | 0.0791886 | 780 | 1.294552 | 1.605448 |
| Construct #1 | Item #3 | Rated Construct #1 | 4.625 | 0.0791886 | 780 | 4.469552 | 4.780448 |
| Construct #2 | Item #3 | Rated Construct #1 | 1.500 | 0.0791886 | 780 | 1.344552 | 1.655448 |
| Construct #1 | Item #4 | Rated Construct #1 | 4.325 | 0.0791886 | 780 | 4.169552 | 4.480448 |
| Construct #2 | Item #4 | Rated Construct #1 | 1.525 | 0.0791886 | 780 | 1.369552 | 1.680448 |
| Construct #1 | Item #5 | Rated Construct #1 | 4.450 | 0.0791886 | 780 | 4.294552 | 4.605448 |
| Construct #2 | Item #5 | Rated Construct #1 | 1.500 | 0.0791886 | 780 | 1.344552 | 1.655448 |
| Construct #1 | Item #1 | Rated Construct #2 | 1.500 | 0.0791886 | 780 | 1.344552 | 1.655448 |
| Construct #2 | Item #1 | Rated Construct #2 | 4.500 | 0.0791886 | 780 | 4.344552 | 4.655448 |
| Construct #1 | Item #2 | Rated Construct #2 | 1.475 | 0.0791886 | 780 | 1.319552 | 1.630448 |
| Construct #2 | Item #2 | Rated Construct #2 | 4.425 | 0.0791886 | 780 | 4.269552 | 4.580448 |
| Construct #1 | Item #3 | Rated Construct #2 | 1.675 | 0.0791886 | 780 | 1.519552 | 1.830448 |
| Construct #2 | Item #3 | Rated Construct #2 | 4.525 | 0.0791886 | 780 | 4.369552 | 4.680448 |
| Construct #1 | Item #4 | Rated Construct #2 | 1.450 | 0.0791886 | 780 | 1.294552 | 1.605448 |
| Construct #2 | Item #4 | Rated Construct #2 | 4.475 | 0.0791886 | 780 | 4.319552 | 4.630448 |
| Construct #1 | Item #5 | Rated Construct #2 | 1.525 | 0.0791886 | 780 | 1.369552 | 1.680448 |
| Construct #2 | Item #5 | Rated Construct #2 | 4.475 | 0.0791886 | 780 | 4.319552 | 4.630448 |
These means are hard to interpret, but a visualization makes the task easier:
At this point the researcher can be fairly confident that their new measure(s) adequately represent the construct(s) under examination.
However, it is now necessary to administer them to samples that are representative of the target population for further testing (i.e., psychometric scaling).
Ideally, we would want several independent samples to establish initial evidence for item operations, factor structures (EFA & CFA), and evidence of convergent, discriminant, and criterion-related validity.
You can think of the next few steps as unfolding as a series of independent studies that are each purposefully conducted to demonstrate the operation of your measure(s).
Cook, J.D., Hepworth, S. J., Wall, T. D., & Warr, P. B. (1981). Experience of work: A compendium and review of 249 measures and their use. New York, NY: Academic Press.
Williamson, J., Karp, D., Dalphin, J. & Gray, P. (1982) Chapter 7: Intensive interviewing, The Research Craft, Little, Brown & Co.
Kerlinger,F. (1986). Foundations of behavioral research (3rd ed.). Holt, Rinehart & Winston.
Nunnally, J.C. (1978). Psychometric theory (2nd ed.). McGraw-Hill.
Conway, J. M., & Huffcutt, A. I. (2003). A review and evaluation of exploratory factor analysis practices in organizational research. Organizational research methods, 6(2), 147-168.
Harrison, D. A., & McLaughlin, M. E. (1993). Cognitive processes in self-report responses: Tests of item context effects in work attitude measures. Journal of Applied Psychology, 78(1), 129–140. https://doi.org/10.1037/0021-9010.78.1.129
Schmitt, N.W., & Stults, D.M. (1985). Factors defined by negatively keyed items: The results of careless respondents? Applied Psychological Measurement, 9, 367–373.
Carmines, E.G., & Zeller, R.A. (1979). Reliability and validity assessment. SAGE.
Harvey, R.J., Billings, R.S.,& Nilan, K.J. (1985). Confirmatory factor analysis of the job diagnostic survey: Good news and bad news. Journal of Applied Psychology, 70, 461–468.
Cook, J.D., Hepworth, S.J., Wall, T.D., & Warr, P.B. (1981). The experience of work. Academic Press.
Carmines, E.G., & Zeller, R.A. (1979). Reliability and validity assessment. SAGE.
Hinkin, T.R. (1995). A review of scale development practices in the study of organizations. Journal of Management, 21, 967–988.
Hinkin, T.R. (1998). A brief tutorial on the development of measures for use in survey questionnaires. Organizational Research Methods, 1, 104–121.
Churchill, G.A. (1979). A paradigm for developing better measures of marketing constructs. Journal of Marketing Research, 16, 64–73.
Lissitz,R.W., & Green, S.B.(1975).Effect of the number of scale points on reliability:A Monte Carlo approach.Journal of Applied Psychology,60,10–13.
Lawshe, C. H. (1975). A quantitative approach to content validity. Personnel psychology, 28(4), 563-575.
Hinkin,T.R.,& Tracey,J.B.(1999).An analysis of variance approach to content validation.Organizational Research Methods,2(2),175–186.
Stevens, S. S. (1946). On the theory of scales of measurement. Science, 103(2684), 677-680.