This document is one of a series designed to illustrate how the R statistical computing environment can be used to conduct various types of social work research. In this report, we present an example of using R to conduct an item response theory analysis (IRT) of the Future Orientation Scale included in a compendium of scales — The School Success Profile (SSP) — designed for use by school social workers. The data set used in this analysis was generously provided by Dr. Gary Bowen (through Dr. Natasha Bowen). The data file is composed of responses to various SSP items by sixth-ninth grade students in 17 schools in North Carolina (n = 5171).
For this analysis, we use the R mirt package to fit and assess a graded response model to the 12-item Future Orientation scale (FOS) (which has also been referred to as the Success Orientation scale and Optimism Scale). The items are as follows:
The response categories are Strongly disagree, Disagree, Agree. and Strongly agree.
As noted, we used the R mirt package to fit a graded response model (the recommended model for ordered polytomous response data) using a full-information maximum likelihood fitting function. In addition, we assessed model fit using an index, M2, which is specifically designed to assess the fit of item response models for ordinal data. We used the M2-based root mean square error of approximation as the primary fit index. We also used the standardized root mean square residual (SRMSR) and comparative fit index (CFI)to assess adequacy of model fit.
mod1 <- (mirt(scale, 1, verbose = FALSE, itemtype = 'graded', SE = TRUE))
M2(mod1, type = "M2*", calcNULL = FALSE)
## M2 df p RMSEA RMSEA_5 RMSEA_95 SRMSR TLI
## stats 628.2452 30 0 0.06406275 0.05974717 0.06846658 0.05689246 0.9234026
## CFI
## stats 0.9452876
The obtained RMSEA value = .064 (95% CI[.060, .069]) and SRMSR value = .057 suggest that data fit the model reasonably well using suggested cutoff values of RMSEA <= .06 and SRMSR <= .08 as suggested guidelines for assessing fit. The CFI = .945 was just below a recommended .95 threshold (although it would be .95 rounded).
A second area of of interest is to assess how well each item fits the model. For this assessment, we use a recommended index–S-X2. The mirt implementation of S-X2 computes an RMSEA value which can be used to assess degree of item fit. Values less than .06 are considered evidence of adequate fit.
itemfit(mod1)
## item S_X2 df.S_X2 RMSEA.S_X2 p.S_X2
## 1 Item.1 278.451 54 0.029 0
## 2 Item.2 127.913 49 0.018 0
## 3 Item.3 1299.710 72 0.059 0
## 4 Item.4 188.488 54 0.023 0
## 5 Item.5 226.982 52 0.026 0
## 6 Item.6 116.652 49 0.017 0
## 7 Item.7 262.456 54 0.028 0
## 8 Item.8 130.213 47 0.019 0
## 9 Item.9 134.001 45 0.020 0
## 10 Item.10 264.172 49 0.030 0
## 11 Item.11 421.270 62 0.035 0
## 12 Item.12 115.542 45 0.018 0
All of the RMSEA values are less than .06 indicating that the items had adequate fit with the model.
Once we established the adequacy of model and item fit, we then computed item parameters. IRT provides two assessments of item-latent trait relationships. The IRT parameterization generates discrimination and location parameters. The factor analysis parameterization generates factor loadings and communlaities.
# IRT parameters
coef(mod1, IRTpars = TRUE, simplify = TRUE)
## $items
## a b1 b2 b3
## Item.1 2.650 -2.240 -1.579 0.045
## Item.2 3.104 -2.196 -1.489 0.003
## Item.3 1.378 -2.629 -1.858 -0.164
## Item.4 2.652 -2.418 -1.660 0.145
## Item.5 2.757 -2.345 -1.388 0.300
## Item.6 2.964 -2.388 -1.751 -0.001
## Item.7 2.609 -2.296 -1.463 0.186
## Item.8 3.381 -2.312 -1.697 -0.122
## Item.9 3.883 -2.101 -1.530 -0.068
## Item.10 3.145 -2.319 -1.813 -0.486
## Item.11 2.262 -2.452 -1.845 -0.404
## Item.12 3.710 -2.195 -1.625 -0.203
##
## $means
## F1
## 0
##
## $cov
## F1
## F1 1
The estimated IRT parameters are shown above. The values of the slope (a-parameters) parameters ranged from 1.38 to 3.88. A slope parameter is a measure of how well an item differentiates respondents with different levels of the latent trait. Larger values, or steeper slopes, are better at differentiating theta. A slope also can be interpreted as an indicator of the strength of a relationship between and item and latent trait, with higher slope values corresponding to stronger relationships. Item 9 was the most discriminating items with a slope estimate of 3.88 while Item 3 was the least discriminating item with a slope estimate of 1.39.
Three location parameters (b-parameters) also are listed for each item. Location parameters are interpreted as the value of theta that corresponds to a .5 probability of responding at or above that location on an item. There are m-1 location parameters where m refers to the number of response categories on the response scale. The FOS has four possible responses so there are three location parameters for each item. The location patterns for each of our items indicated that they provided good coverage at lower ends of the theta scale. Location parameters are expressed in theta units (standard normal z-scores) so a negative sign indicates that a parameter falls below the mean on the theta scale and a positive value indicates that a parameter falls above the mean. An example interpretation of the Item 9 location parameter b2 = -1.53 is that it is the point on theta where a respondent has a .5 probability of responding to response categories “Agree”, or “Strongly agree”. Similarly, the Item 9 location parameter b3 = -0.68 is the point on theta where a respondent has a .5 probability of responding to the “Strongly agree” response category.
# Factor loadings
summary(mod1)
## F1 h2
## Item.1 0.841 0.708
## Item.2 0.877 0.769
## Item.3 0.629 0.396
## Item.4 0.842 0.708
## Item.5 0.851 0.724
## Item.6 0.867 0.752
## Item.7 0.838 0.702
## Item.8 0.893 0.798
## Item.9 0.916 0.839
## Item.10 0.879 0.774
## Item.11 0.799 0.639
## Item.12 0.909 0.826
##
## SS loadings: 8.634
## Proportion Var: 0.719
##
## Factor correlations:
##
## F1
## F1 1
Factor loadings can be interpreted as a strength of the relationship between an item and the latent variable (F1). The loadings range from .63 (item 3) to .92 (item 9) and can be interpreted as the correlation between an item and the latent trait. Communalities (h2) are squared factor loadings and are interpreted as the variance accounted for in an item by the latent trait. All of the items had a substantive relationship (loadings > .50) with the latent trait.
A strength of IRT is the ability to visually examine item and scale characteristics using various plots. These plots display how each item and the total scale relate to the latent trait across trait values. This capacity is where IRT methods have an advantage over classical test methods and CFA/SEM methods.
In this example, we explore item and scale latent trait relationships using:
It often is of interest to examine the probabilities of responding to specific categories in an item’s response scale. These probabilities are graphically displayed in the category response curves (CRCs) shown below.
plot(mod1, type='trace', which.item = c(1,2,3,4,5,6,7,8,9,10,11,12), facet_items=T,
as.table = TRUE, auto.key=list(points=F, lines=T, columns=4, space = 'top', cex = .8),
theta_lim = c(-3, 3),
main = "")
Each symmetrical curve represents the probability of endorsing a response category (P1 = ‘Strongly disagree’, P2 = ‘Disagree”, P3 = “Agree”, and P4 = “Strongly agree”). These curves have a functional relationship with theta; As theta increases, the probability of endorsing a category increases and then decreases as responses transition to the next higher category. The CRCs indicate that the response categories are located in the lower range of theta. This can be interpreted as follows: For most items it does not take a high level of theta – a high success orientation – to endorse response categories.
### Item information curves Information is a statistical concept that refers to the ability of an item to accurately estimate scores on theta. Item level information clarifies how well each item contributes to score estimation precision with higher levels of information leading to more accurate score estimates.
plot(mod1, type='infotrace', which.item = c(1,2,3,4,5,6,7,8,9,10,11,12), facet_items=T,
as.table = TRUE, auto.key=list(points=F, lines=T, columns=1, space = 'right', cex = .8),
theta_lim = c(-3, 3),
main="")
In polytomous models, the amount of information an item contributes depends on its slope parameter—–the larger the parameter, the more information the item provides. Further, the farther apart the location parameters (b1, b2, b3), the more information the item provides. Typically, an optimally informative polytomous item will have a large location and broad category coverage (as indicated by location parameters) over theta.
Information functions are best illustrated by the item information curves for each item as displayed above. These curves show that item information is not a static quantity, rather, it is conditional on levels of theta. The relationship between slopes and information is illustrated here. Item 3 had the lowest slope and is, therefore, the least informative item.On the other hand, Item 9 had the highest slope and provides the highest amount of statistical information. Items tended to provide the most information between -2.5 to + 1 theta range. The “wavy” form of the curves reflects the fact that item information is a composite of category information, that is, each category has an information function which is then combined to form the item information function. The dips in each in curve suggest that the response category Agree is not as informative as the Strongly disagree, Disagree, and Strongly agree response categories.
One particularly helpful IRT capacity is that information for individual items can be summed to form a scale information function. A scale information function is a summary of how well items, overall, provide statistical information about the latent trait. Further, scale information values can be used to compute conditional standard errors which indicate how precisely scores can be estimated across different values of theta.
plot(mod1, type = 'infoSE', theta_lim = c(-3, 3),
main="")
The relationship between scale information and conditional standard errors is illustrated above. The solid blue line represents the scale information function. The overall scale provided the most information in the range -2.5 to + 1. The red line provides a visual reference about how estimate precision varies across theta with smaller values corresponding to better estimate precision. Because conditional standard errors mathematically mirror the scale information curve, estimated score precision was best in the -2.5 to + 1 theta range.
IRT approaches the concept of scale reliability differently than the traditional classical test theory approach using coefficient alpha or omega. The CTT approach assumes that reliability is based on a single value that applies to all scale scores. For example, coefficient alpha for the FOS = .93.
plot(mod1, type = 'rxx', theta_lim = c(-3, 3),
main="" )
The concept of conditional reliability is illustrated in the above. This curve is mathematically related to both scale information and conditional standard errors through simple transformations. Because of this relationship, score estimates are most reliable in the -2.5 to + 1 theta range.
It also is possible to compute a single IRT reliability estimate. The marginal reliability for the FOS = .88.
marginal_rxx(mod1)
## [1] 0.8806908
As a next step, we used model parameters to generate estimates of student theta scores. These scores are referred to as person parameters in IRT (they are called factor scores in CFA). We used a latent trait scoring procedure called expected a posteriori (EAP) estimation to generate the scores. Keep in mind the estimates are in the theta (standard normal) metric so they are z-like scores. Thus, IRT model-based scores have favorable properties that improve on a summed score approach. First, model-based scores reflect the impacts of parameter estimates obtained from the IRT model used. As a result, because they are weighted by item parameters, theta score estimates often show more variability than summed scores. They also can be interpreted in the standard normal framework; because they are given in a standard normal metric, we can use our knowledge of the standard normal distribution to make score comparisons across individuals. For example, someone with a theta score of 1.0 is one standard deviation above average and we can expect that 84% of the sample to have lower scores and 16% to have higher scores. Other comparisons of interest based on standard normal characteristics are possible.
Once model-based theta score estimates are computed, it often is of interest to transform those estimates into the original scale metric. A scale characteristic function provides a means of transforming estimated theta scores to expected true scores in the original scale metric. This transformation back into the original scale metric provides a more familiar frame of reference for interpreting scores. In this study, expected true scores refer to scores on the FOS scale metric (12 to 48) that are expected as a function of estimated student theta scores.
plot(mod1, type = 'score', theta_lim = c(-3, 3), main = "")
The scale characteristic function can be graphically displayed as shown above. It has a straightforward use; for any given estimated theta score we can easily find a corresponding expected true score in the summed scale score metric. For example, an estimated theta score of -1 would translate into an expected true score of 34; an estimated theta score of 0 would translate into an expected true score of 42. These true score transformations often are of interest in practical situations where scale users are not familiar with theta scores. Also, true score estimates can be used in other important statistical analyses and are often improvements over traditional summed scores.
The overall conclusion we reached about the FOS based on the IRT analysis is that the scale is a psychometrically sound measure of future orientation. What follows is a brief summary of our conclusions.