Survey and Measurement Development

A full survey project usually consists of six phases.

Item Generation. Start by generating a list of candidate survey items. With help from SMEs, you evaluate the equivalence (interrater reliability) and content validity of the candidate survey items and pare down the list into the final survey.
Survey Administration. Administer the survey to respondents and perform an exploratory data analysis. Summarize the Likert items with plots and look for correlations among the variables.
Item Reduction. Explore the dimensions of the latent variable in the survey data with parallel analysis and exploratory factor analysis. Assess the internal consistency of the items with Cronbach’s alpha and split-half tests, and remove items that do not add value and/or amend your theory of the number of dimensions.
Confirmatory Factor Analysis. Perform a formal hypothesis test of the theory that emerged from the exploratory factor analysis.
Convergent/Discriminant Validity. Test for convergent and discriminant construct validity.
Replication. Establish test-retest reliability and criterion validity.

Background: Quality

Quality surveys should be both reliable (consistent) and valid (accurate).

Reliability

A survey’s reliability is the extent to which the results can be reproduced when repeated under similar conditions. You assess reliability three ways:

Equivalence (aka Interrater). Interrater reliability generally means it does not matter who does the “rating” or measuring because the assessment criteria is unambiguous. For surveys, this means each item should be unambiguous, so if a group of subject matter experts (SMEs) took the survey they would respond identically.
Internal consistency. Internal consistency means different parts of a test measuring the same thing will yield the same results. For surveys, this means responses to items measuring the same thing ought to be highly correlated.
Stability (aka Test-retest). Test-retest reliability means you can expect the same results from repeated measurements.

You use the Cohen’s Kappa test of inter-rater reliability to test equivalence in the Item Generation phase (1). You use the Cronbach alpha test and split-half tests to assess internal consistency in the Item Reduction phase (3). You use the test-retest construct to assess stability in the Replication phase (6).

Validity

A survey’s Validity is how accurately it measures the latent variable it was intended to measure. A high validity survey produces results that correspond to real characteristics of the social world. Reliability is a necessary, but not sufficient, condition for validity. You assess validity three ways.

Content. Content validity means the measurement covers all aspects of the latent variable.
Construct. Construct validity means the measure is properly grounded on a theory of the latent variable.
Criterion. Criterion validity means the result of a measure corresponds to other valid measures of the same latent variable.

You use Lawshe’s CVR to assess content validity in the Item Generation phase (1). You use convergent analysis and discriminant analysis to assess construct validity in the Convergent/Discriminant Validity phase (5). You use concurrent analysis and predictive analysis to assess criterion validity in the Replication phase (6).

1. Item Generation

After you generate a list of candidate survey items, enlist SMEs to assess their inter-rater reliability with Cohen’s Kappa and content validity with Lawshe’s CVR.

Cohen’s Kappa

An item has interrater reliability if it produces consistent results across raters. One way to test this is by having SMEs take the survey. Their answers should be close to each other. Conduct an inter-rater reliability test by measuring the statistical significance of SME response agreement using the Kohen’s kappa test statistic.

Suppose your survey measures brand loyalty and two SMEs answer 13 survey items like this (showing first 6):

head(sme)

##   RATER_A RATER_B
## 1       1       1
## 2       2       2
## 3       3       2
## 4       2       3
## 5       1       3
## 6       1       1

You could measure SME agreement with a simple correlation matrix (with cor(sme)) or by measuring the percentage of items they rate identically (with irr::agree(sme)), but these measures to not tests for statistical validity.

cor(sme)
##           RATER_A   RATER_B
## RATER_A 1.0000000 0.3291403
## RATER_B 0.3291403 1.0000000
irr::agree(sme)
##  Percentage agreement (Tolerance=0)
## 
##  Subjects = 13 
##    Raters = 2 
##   %-agree = 46.2

Instead, calculate the Kohen’s kappa test statistic to assess statistical validity. In general, Cohen’s kappa compares the observed agreement (accuracy) to the probability of chance agreement. \(\kappa\) < 0.4 indicates poor agreement, \(\kappa\) >= 0.4 is moderate, \(\kappa\) >= 0.6 is substantial, and \(\kappa\) >= 0.8 is very strong agreement.

psych::cohen.kappa(sme)

## Call: cohen.kappa1(x = x, w = w, n.obs = n.obs, alpha = alpha, levels = levels)
## 
## Cohen Kappa and Weighted Kappa correlation coefficients and confidence boundaries 
##                   lower estimate upper
## unweighted kappa -0.179     0.19  0.55
## weighted kappa   -0.099     0.32  0.73
## 
##  Number of subjects = 13

Use the weighted kappa for ordinal measures like Likert items (see Wikipedia). In this example, \(\kappa\) is only 0.32 (poor agreement).

Lawshe’s CVR

And item has content validity if SMEs agree on its relevance to the latent variable. Test content validity with Lawshe’s content validity ratio (CVR),

\[CVR = \frac{E - N/2}{N/2}\] where \(N\) is the number of SMEs and \(E\) is the number who rate the item as essential. CVR can range from -1 to 1. E.g., suppose three SMEs (A, B, and C) assess the relevance of 5 survey items as “Not Necessary”, “Useful”, or “Essential”:

##   item            A            B            C
## 1    1    Essential       Useful Not necesary
## 2    2       Useful Not necesary       Useful
## 3    3 Not necesary Not necesary    Essential
## 4    4    Essential       Useful    Essential
## 5    5    Essential    Essential    Essential

Use the psychometric::CVratio() function to calculate CVR. The threshold CVR to keep or drop an item depends on the number of raters. CVR should be >= 0.99 for 5 experts; >= 0.49 for 15, and >= 0.29 for 40.

sme2 %>% 
  pivot_longer(-item, names_to = "expert", values_to = "rating") %>%
  group_by(item) %>% 
  summarize(.groups = "drop",
            n_sme = length(unique(expert)),
            n_ess = sum(rating == "Essential"),
            CVR = psychometric::CVratio(NTOTAL = n_sme, NESSENTIAL = n_ess))

## # A tibble: 5 x 4
##    item n_sme n_ess    CVR
##   <int> <int> <int>  <dbl>
## 1     1     3     1 -0.333
## 2     2     3     0 -1    
## 3     3     3     1 -0.333
## 4     4     3     2  0.333
## 5     5     3     3  1

In this example, items vary widely in content validity from unanimous consensus for to unanimous consensus against.

2. Survey Administration

The second phase of a survey analysis is to collect the responses and perform an exploratory data analysis to familiarize yourself with the data.

Frequencies

brand_rep is a brand reputation survey of n = 599 respondents answering nine 5-point Likert-scale items. The responses come in as numeric, and you will want to leave them that way for most analyses.

brand_rep <- read_csv(url("https://assets.datacamp.com/production/repositories/4494/datasets/59b5f2d717ddd647415d8c88aa40af6f89ed24df/brandrep-cleansurvey-extraitem.csv"))

psych::response.frequencies(brand_rep)

##                        1         2          3           4          5 miss
## well_made      0.6493739 0.2665474 0.04830054 0.014311270 0.02146691    0
## consistent     0.6779964 0.2343470 0.04651163 0.019677996 0.02146691    0
## poor_workman_r 0.7584973 0.1699463 0.04472272 0.007155635 0.01967800    0
## higher_price   0.3130590 0.2593918 0.18425760 0.119856887 0.12343470    0
## lot_more       0.1484794 0.1735242 0.15563506 0.162790698 0.35957066    0
## go_up          0.1842576 0.1824687 0.19677996 0.177101968 0.25939177    0
## stands_out     0.4275492 0.3041145 0.13953488 0.067978533 0.06082290    0
## unique         0.4579606 0.2933810 0.11985689 0.069767442 0.05903399    0
## one_of_a_kind  0.6225403 0.1949911 0.08944544 0.023255814 0.06976744    0

Summarize Likert response with the likert::likert() function. This is the one place where you will need the items to be treated as factors.

brand_rep %>%
  data.frame() %>% # read_csv() returns a tibble
  mutate(across(everything(), as.factor)) %>%  # likert() uses factors
  likert::likert() %>%
  plot() + 
  labs(title = "Brand Reputation Survey") +
  theme(legend.position = "top")

Missing values may mean respondents did not understand the question or did not want to reveal their answer. If <5% of survey responses have no missing values, you can just drop those responses. If missing values are a problem, try the Hmisc::naclus() to see which items tend to be missing in the same record. This survey is clean.

nrow(brand_rep) - nrow(na.omit(brand_rep)) # num cases
## [1] 0
colSums(is.na(brand_rep)) # num cases by col
##      well_made     consistent poor_workman_r   higher_price       lot_more 
##              0              0              0              0              0 
##          go_up     stands_out         unique  one_of_a_kind 
##              0              0              0              0

Correlations

You will want to identify items that correlate highly with each other, but not highly outside their group. These patterns are the basis of mapping factors to the latent variables. Factors are the concrete survey items; latent variables are the abstract concepts they are intended to supply, like brand loyalty or customer satisfaction. The correlation plot below appears to have 3 groups, plus a stand-alone variable (one_of_a_kind).

#psych::corr.test(brand_rep)
corrplot::corrplot(cor(brand_rep), method = "circle")

3. Item Reduction

The third phase, explores the mapping of the factors (aka “manifest variables”) to the latent variable’s “dimensions” and refines the survey to exclude factors that do not map to a dimension. A latent variable may have several dimensions. E.g., “brand loyalty” may consist of “brand identification”, “perceived value”, and “brand trust”. Exploratory factor analysis (EFA), identifies the dimensions in the data, and whether any items do not reveal information about the latent variable. EFA establishes the internal reliability, whether similar items produce similar scores.

Start with a parallel analysis and scree plot. This will suggest the number of factors in the data. Use this number as the input to an exploratory factor analysis.

Parallel Analysis

A scree plot is a line plot of the eigenvalues. An eigenvalue is the proportion of variance explained by each factor. Only factors with eigenvalues greater than those from uncorrelated data are useful. You want to find a sharp reduction in the size of the eigenvalues (like a cliff), with the rest of the smaller eigenvalues constituting rubble (scree!). After the eigenvalues drop dramatically in size, additional factors add relatively little to the information already extracted.

Parallel analysis helps to make the interpretation of scree plots more objective. The eigenvalues are plotted along with eigenvalues of simulated variables with population correlations of 0. The number of eigenvalues above the point where the two lines intersect is the suggested number of factors. The rationale for parallel analysis is that useful factors account for more variance than could be expected by chance.

psych::fa.parallel() compares a scree of your data set to a random data set to identify the number of factors. The elbow below here is at 3 factors.

psych::fa.parallel(brand_rep)

## Parallel analysis suggests that the number of factors =  3  and the number of components =  2

Exporatory Factor Analysis

Use psych::fa() to perform the factor analysis with your chosen number of factors. The number of factors may be the result of your parallel analysis, or the opinion of the SMEs. In this case, we’ll go with the 3 factors identified by the parallel analysis.

brand_rep_efa <- psych::fa(brand_rep, nfactors = 3)

## Loading required namespace: GPArotation

# psych::scree(brand_rep) # psych makes scree plot's too.
psych::fa.diagram(brand_rep_efa)

Using EFA, you may tweak the number of factors or drop poorly-loading items. Each item should load highly to one and only one dimension. This one dimension is the item’s primary loading. Generally, a primary loading > .7 is excellent, >.6 is very good, >.5 is good, >.4 is fair, and <.4 is poor. Here are the factor loadings from the 3 factor model.

brand_rep_efa$loadings

## 
## Loadings:
##                MR2    MR1    MR3   
## well_made       0.896              
## consistent      0.947              
## poor_workman_r  0.739              
## higher_price    0.127  0.632  0.149
## lot_more               0.850       
## go_up                  0.896       
## stands_out                    1.008
## unique                        0.896
## one_of_a_kind   0.309  0.295  0.115
## 
##                  MR2   MR1   MR3
## SS loadings    2.361 2.012 1.858
## Proportion Var 0.262 0.224 0.206
## Cumulative Var 0.262 0.486 0.692

The brand-rep survey items load to 3 factors well except for the one_of_a_kind item. Its primary factor loading (0.309) is poor. The others are either very good (.6-.7) and excellent (>.7) range.

Look at the model eigenvalues. There should be one eigenvalue per dimension. Eigenvalues a little less than one may be contaminating the model.

brand_rep_efa$e.value

## [1] 4.35629549 1.75381015 0.98701607 0.67377072 0.39901205 0.35865598 0.23915591
## [8] 0.14238807 0.08989556

Look at the factor score correlations. They should all be around 0.6. Much smaller means they are not describing the same latent variable. Much larger means they are describing the same dimension of the latent variable.

brand_rep_efa$r.scores

##           [,1]      [,2]      [,3]
## [1,] 1.0000000 0.3147485 0.4778759
## [2,] 0.3147485 1.0000000 0.5428218
## [3,] 0.4778759 0.5428218 1.0000000

If you have a poorly loaded dimension, drop factors one at a time from the scale. one_of_a_kind loads across all three factors, but does not load strongly onto any one factor. one_of_a_kind is not clearly measuring any dimension of the latent variable. Drop it and try again.

brand_rep_efa <- psych::fa(brand_rep %>% select(-one_of_a_kind), nfactors = 3)
brand_rep_efa$loadings

## 
## Loadings:
##                MR2    MR3    MR1   
## well_made       0.887              
## consistent      0.958              
## poor_workman_r  0.735              
## higher_price    0.120  0.596  0.170
## lot_more               0.845       
## go_up                  0.918       
## stands_out                    0.990
## unique                        0.916
## 
##                  MR2   MR3   MR1
## SS loadings    2.261 1.915 1.850
## Proportion Var 0.283 0.239 0.231
## Cumulative Var 0.283 0.522 0.753

brand_rep_efa$e.value

## [1] 4.00884429 1.75380537 0.97785883 0.42232527 0.36214534 0.24086772 0.14404264
## [8] 0.09011054

brand_rep_efa$r.scores

##           [,1]      [,2]      [,3]
## [1,] 1.0000000 0.2978009 0.4802138
## [2,] 0.2978009 1.0000000 0.5351654
## [3,] 0.4802138 0.5351654 1.0000000

This is better. We have three dimensions of brand reputation:

items well_made, consistent, and poor_workman_r describe Product Quality,
items higher_price, lot_more, and go_up describe Willingness to Pay, and
items stands_out and unique describe Product Differentiation

Even if the data and your theory suggest otherwise, explore what happens when you include more or fewer factors in your EFA.

psych::fa(brand_rep, nfactors = 2)$loadings

## 
## Loadings:
##                MR1    MR2   
## well_made              0.884
## consistent             0.932
## poor_workman_r         0.728
## higher_price    0.745       
## lot_more        0.784 -0.131
## go_up           0.855 -0.103
## stands_out      0.591  0.286
## unique          0.540  0.307
## one_of_a_kind   0.378  0.305
## 
##                  MR1   MR2
## SS loadings    2.688 2.485
## Proportion Var 0.299 0.276
## Cumulative Var 0.299 0.575

psych::fa(brand_rep, nfactors = 4)$loadings

## 
## Loadings:
##                MR2    MR1    MR3    MR4   
## well_made       0.879                     
## consistent      0.964                     
## poor_workman_r  0.732                     
## higher_price           0.552  0.150  0.168
## lot_more               0.835              
## go_up                  0.929              
## stands_out                    0.976       
## unique                        0.932       
## one_of_a_kind                        0.999
## 
##                  MR2   MR1   MR3   MR4
## SS loadings    2.244 1.866 1.846 1.029
## Proportion Var 0.249 0.207 0.205 0.114
## Cumulative Var 0.249 0.457 0.662 0.776

The two-factor loading worked okay. The 4 factor loading only loaded one variable to the fourth factor. In this example the SME expected a three-factor model and the data did not contradict the theory, so stick with three.

Whereas the item generation phase tested for item equivalence, the EFA phase tests for internal reliability (consistency) of items. Internal reliability means the survey produces consistent results. The more common statistics for assessing internal reliability are Cronbach’s Alpha, and split-half.

Cronbach’s Alpha

In general, an alpha <.6 is unacceptable, <.65 is undesirable, <.7 is minimally acceptable, <.8 is respectable, <.9 is very good, and >=.9 suggests items are too alike. A very low alpha means items may not be measuring the same construct, so you should drop items. A very high alpha means items are multicollinear, and you should drop items. Here is Cronbach’s alpha for the brand reputation survey, after removing the poorly-loading one_of_a_kind variable.

psych::alpha(brand_rep[, 1:8])$total$std.alpha

## [1] 0.8557356

This value is in the “very good” range. Cronbach’s alpha is often used to measure the reliability of a single dimension. Here are the values for the 3 dimensions.

psych::alpha(brand_rep[, 1:3])$total$std # Product Quality
## [1] 0.8918025
psych::alpha(brand_rep[, 4:6])$total$std # Willingness to Pay
## [1] 0.8517566
psych::alpha(brand_rep[, 7:8])$total$std # Product Differentiation
## [1] 0.951472

Alpha is >0.7 for each dimension. Sometimes the alpha for our survey as a whole is greater than that of the dimensions. This can happen because Cronbach’s alpha is sensitive to the number of items. Over-inflation of the alpha statistic can be a concern when working with surveys containing a large number of items.

Split-Half

Use psych::splitHalf() to split the survey in half and test whether all parts of the survey contribute equally to measurement. This method is much less popular than Cronbach’s alpha.

psych::splitHalf(brand_rep[, 1:8])

## Split half reliabilities  
## Call: psych::splitHalf(r = brand_rep[, 1:8])
## 
## Maximum split half reliability (lambda 4) =  0.93
## Guttman lambda 6                          =  0.92
## Average split half reliability            =  0.86
## Guttman lambda 3 (alpha)                  =  0.86
## Guttman lambda 2                          =  0.87
## Minimum split half reliability  (beta)    =  0.66
## Average interitem r =  0.43  with median =  0.4

4. Confirmatory Factor Analysis

Whereas EFA is used to develop a theory of the number of factors needed to explain the relationships among the survey items, confirmatory factor analysis (CFA) is a formal hypothesis test of the EFA theory. CFA measures construct validity, that is, whether you are really measuring what you claim to measure.

These notes explain how to use CFA, but do not explain the theory. For that you need to learn about dimensionality reduction, and structural equation modeling.

Use the lavaan package (latent variable analysis package), passing in the model definition. Here is the model for the three dimensions in the brand reputation survey. Lavaan’s default estimator is maximum likelihood, which assumes normality. You can change it to MLR which uses robust standard errors to mitigate non-normality. The summary prints a ton of output. Concentrate on the lambda - the factor loadings.

brand_rep_mdl <- paste(
  "PrdQl =~ well_made + consistent + poor_workman_r",
  "WillPay =~ higher_price + lot_more + go_up",
  "PrdDff =~ stands_out + unique", 
  sep = "\n"
)
brand_rep_cfa <- lavaan::cfa(model = brand_rep_mdl, data = brand_rep[, 1:8], estimator = "MLR")
# lavaan::summary(brand_rep_cfa, fit.measures = TRUE, standardized = TRUE)
semPlot::semPaths(brand_rep_cfa, rotation = 4)

## Registered S3 method overwritten by 'kutils':
##   method       from  
##   print.likert likert

## Registered S3 methods overwritten by 'huge':
##   method    from   
##   plot.sim  BDgraph
##   print.sim BDgraph

lavaan::inspect(brand_rep_cfa, "std")$lambda

##                PrdQl WillPy PrdDff
## well_made      0.914  0.000  0.000
## consistent     0.932  0.000  0.000
## poor_workman_r 0.730  0.000  0.000
## higher_price   0.000  0.733  0.000
## lot_more       0.000  0.812  0.000
## go_up          0.000  0.902  0.000
## stands_out     0.000  0.000  0.976
## unique         0.000  0.000  0.930

The CFA hypothesis test is a chi-square test, so is sensitive to normality assumptions and n-size. Other fit measure are reported too: * Comparative Fit Index (CFI) (look for value >.9) * Tucker-Lewis Index (TLI) (look for value >.9) * Root mean squared Error of Approximation (RMSEA) (look for value <.05)

There are actually 71 fit measures to choose from! Focus on CFI and TLI.

lavaan::fitMeasures(brand_rep_cfa, fit.measures = c("cfi", "tli"))

##   cfi   tli 
## 0.980 0.967

This output indicates a good model because both measures are >.9. Check the standardized estimates for each item. The standardized factor loadings are the basis of establishing construct validity. While we call these measures ‘loadings,’ they are better described as correlations of each manifest item with the dimensions. As you calculated, the difference between a perfect correlation and the observed is considered ‘error.’ This relationship between the so-called ‘true’ and ‘observed’ scores is the basis of classical test theory.

lavaan::standardizedSolution(brand_rep_cfa) %>%
  filter(op == "=~") %>%
  select(lhs, rhs, est.std, pvalue)

##       lhs            rhs est.std pvalue
## 1   PrdQl      well_made   0.914      0
## 2   PrdQl     consistent   0.932      0
## 3   PrdQl poor_workman_r   0.730      0
## 4 WillPay   higher_price   0.733      0
## 5 WillPay       lot_more   0.812      0
## 6 WillPay          go_up   0.902      0
## 7  PrdDff     stands_out   0.976      0
## 8  PrdDff         unique   0.930      0

If you have a survey that meets your assumptions, performs well under EFA, but fails under CFA, return to your survey and revisit your scale, examine the CFA modification indices, factor variances, etc.

5. Convergent/Discriminant Validity

Construct validity means the survey measures what it intends to measure. It is composed of convergent validity and discriminant validity. Convergent validity means factors address the same concept. Discriminant validity means factors address different aspects of the concept.

Test for construct validity after assessing CFA model strength (with CFI, TFI, and RMSEA) – a poor-fitting model may have greater construct validity than a better-fitting model. Use the semTools::reliability() function. The average variance extracted (AVE) measures convergent validity (avevar) and should be > .5. The composite reliability (CR) measures discriminant validity (omega) and should be > .7.

semTools::reliability(brand_rep_cfa)

##            PrdQl   WillPay    PrdDff
## alpha  0.8926349 0.8521873 0.9514719
## omega  0.9017514 0.8605751 0.9520117
## omega2 0.9017514 0.8605751 0.9520117
## omega3 0.9019287 0.8646975 0.9520114
## avevar 0.7573092 0.6756174 0.9084654

These values look good for all three dimensions. As an aside, alpha is Cronbach’s alpha. Do not be tempted to test reliability and validity in the same step. Start with reliability because it is a necessary but insufficient condition for validity. By checking for internal consistency first, as measured by alpha, then construct validity, as measured by AVE and CR, you establish the necessary reliability of the scale as a whole was met, then took it to the next level by checking for construct validity among the unique dimensions.

At this point you have established that the latent and manifest variables are related as hypothesized, and that the survey measures what you intended to measure, in this case, brand reputation.

6. Replication

The replication phase establishes criterion validity and stability (reliability). Criterion validity is a measure of the relationship between the construct and some external measure of interest. Measure criterion validity with concurrent validity, how well items correlate with an external metric measured at the same time, and with predictive validity, how well an item predicts an external metric. Stability means the survey produces similar results over repeated test-retest administrations.

Criterion Validity

Concurrent Validity

Concurrent validity is a measure of whether our latent construct is significantly correlated to some outcome measured at the same time.

Suppose you have an additional data set of consumer spending on the brand. The consumer’s perception of the brand should correlate with their spending. Before checking for concurrent validity, standardize the data so that likert and other variable types are on the same scale.

set.seed(20201004)
brand_rep <- brand_rep %>%
  mutate(spend = ((well_made + consistent + poor_workman_r)/3 * 5 +
                  (higher_price + lot_more + go_up)/3 * 3 +
                  (stands_out + unique)/2 * 2) / 10)
brand_rep$spend <- brand_rep$spend + rnorm(559, 5, 4) # add randomness
brand_rep_scaled <- scale(brand_rep)

Do respondents with higher scores on our the brand reputation scale also tend to spend more at the store? Build model, and latentize spend as Spndng and model with the ~~ operator. Fit the model with the semTools::sem() function.

brand_rep_cv_mdl <- paste(
  "PrdQl =~ well_made + consistent + poor_workman_r",
  "WillPay =~ higher_price + lot_more + go_up",
  "PrdDff =~ stands_out + unique",
  "Spndng =~ spend",
  "Spndng ~~ PrdQl + WillPay + PrdDff",
  sep = "\n"
)
brand_rep_cv <- lavaan::sem(data = brand_rep_scaled, model = brand_rep_cv_mdl)

Here are the standardized covariances. Because the data is standardized, interpret these as correlations. The p-vales are not significant because the spending data was random.

lavaan::standardizedSolution(brand_rep_cv) %>% 
  filter(rhs == "Spndng") %>%
  select(-op, -rhs)

##       lhs est.std    se     z pvalue ci.lower ci.upper
## 1   PrdQl   0.174 0.043 4.092      0    0.091    0.258
## 2 WillPay   0.241 0.043 5.654      0    0.157    0.324
## 3  PrdDff   0.198 0.041 4.779      0    0.117    0.279
## 4  Spndng   1.000 0.000    NA     NA    1.000    1.000

semPlot::semPaths(brand_rep_cv, whatLabels = "est.std", edge.label.cex = .8, rotation = 2)

Each dimension of brand reputation is positively correlated to spending history and the relationships are all significant.

Predictive Validity

Predictive validity is established by regressing some future outcome on your established construct. Assess predictive validity just as you would with any linear regression – regression estimates and p-values (starndardizedSolution()), and the r-squared coefficient of determination inspect().

Build a regression model with the single ~ operator. Then fit the model to the data as before.

brand_rep_pv_mdl <- paste(
  "PrdQl =~ well_made + consistent + poor_workman_r",
  "WillPay =~ higher_price + lot_more + go_up",
  "PrdDff =~ stands_out + unique",
  "spend ~ PrdQl + WillPay + PrdDff",
  sep = "\n"
)
brand_rep_pv <- lavaan::sem(data = brand_rep_scaled, model = brand_rep_pv_mdl)
#lavaan::summary(brand_rep_pv, standardized = T, fit.measures = T, rsquare = T)
semPlot::semPaths(brand_rep_pv, whatLabels = "est.std", edge.label.cex = .8, rotation = 2)

lavaan::standardizedSolution(brand_rep_pv) %>% 
  filter(op == "~") %>%
  mutate_if(is.numeric, round, digits = 3)

##     lhs op     rhs est.std    se     z pvalue ci.lower ci.upper
## 1 spend  ~   PrdQl   0.094 0.048 1.949  0.051   -0.001    0.189
## 2 spend  ~ WillPay   0.182 0.052 3.486  0.000    0.080    0.284
## 3 spend  ~  PrdDff   0.060 0.055 1.092  0.275   -0.047    0.167

lavaan::inspect(brand_rep_pv, "r2")

##      well_made     consistent poor_workman_r   higher_price       lot_more 
##          0.837          0.867          0.533          0.537          0.658 
##          go_up     stands_out         unique          spend 
##          0.816          0.951          0.866          0.072

There is a statistically significant relationship between one dimension of brand quality (Willingness to Pay) and spending. At this point you may want to drop the other two dimensions. However, the R^2 is not good - only 7% of the variability in Spending can be explained by the three dimension of our construct.

Factor scores represent individual respondents’ standing on a latent factor. While not used for scale validation per se, factor scores can be used for customer segmentation via clustering, network analysis and other statistical techniques.

brand_rep_cfa <- lavaan::cfa(brand_rep_pv_mdl, data = brand_rep_scaled)

brand_rep_cfa_scores <- lavaan::predict(brand_rep_cfa) %>% as.data.frame()
psych::describe(brand_rep_cfa_scores)

##         vars   n mean   sd median trimmed  mad   min  max range  skew kurtosis
## PrdQl      1 559    0 0.88  -0.50   -0.19 0.09 -0.57 4.02  4.59  2.31     6.18
## WillPay    2 559    0 0.69   0.00    0.01 0.86 -1.15 1.16  2.31 -0.05    -1.18
## PrdDff     3 559    0 0.96  -0.04   -0.15 1.17 -0.88 2.54  3.41  1.09     0.35
##           se
## PrdQl   0.04
## WillPay 0.03
## PrdDff  0.04

psych::multi.hist(brand_rep_cfa_scores)

map(brand_rep_cfa_scores, shapiro.test)

## $PrdQl
## 
##  Shapiro-Wilk normality test
## 
## data:  .x[[i]]
## W = 0.66986, p-value < 2.2e-16
## 
## 
## $WillPay
## 
##  Shapiro-Wilk normality test
## 
## data:  .x[[i]]
## W = 0.94986, p-value = 7.811e-13
## 
## 
## $PrdDff
## 
##  Shapiro-Wilk normality test
## 
## data:  .x[[i]]
## W = 0.82818, p-value < 2.2e-16

These scores are not normally distributed, which makes clustering a great choice for modeling factor scores. Clustering does not mean distance-based clustering, such as K-means, in this context. Mixture models consider data as coming from a distribution which itself is a mixture of clusters. To learn more about model-based clustering in the Hierarchical and Mixed Effects Models DataCamp course.

Factor scores can be extracted from a structural equation model and used as inputs in other models. For example, you can use the factor scores from the brand reputation dimensions as regressors for a regrssion on spending.

brand_rep_fs_reg_dat <- bind_cols(brand_rep_cfa_scores, spend = brand_rep$spend)
brand_rep_fs_reg <- lm(spend ~ PrdQl + WillPay + PrdDff, data = brand_rep_fs_reg_dat)
summary(brand_rep_fs_reg)$coef

##              Estimate Std. Error    t value      Pr(>|t|)
## (Intercept) 7.1555354  0.1591195 44.9695738 3.363705e-187
## PrdQl       0.4260875  0.2062002  2.0663774  3.925620e-02
## WillPay     1.1365087  0.2805960  4.0503388  5.842799e-05
## PrdDff      0.1714031  0.2181813  0.7855993  4.324375e-01

The coefficients and r-squared of the lm() and sem() models closely resemble each other, but keeping the regression inside the lavaan framework provides more information (as witnessed in the higher estimates and r-squared). A construct, once validated, can be combined with a wide range of outcomes and models to produce valuable information about consumer behavior and habits.

Test-Retest Reliability

Test-retest reliability is the ability to achieve the same result from a respondent at two closely-spaced points in time (repeated measures).

Suppose you had two surveys, identified by an id field.

svy_1 <- brand_rep[sample(1:559, 300),] %>% as.data.frame()
svy_2 <- brand_rep[sample(1:559, 300),] %>% as.data.frame()
survey_test_retest <- psych::testRetest(t1 = svy_1, t2 = svy_2, id = "id")

## boundary (singular) fit: see ?isSingular

survey_test_retest$r12

## [1] -0.08346322

An r^2 <.7 is unacceptable, <.9 good, and >.9 very good. This one is unacceptable.

One way to check for replication is by splitting the data in half.

svy <- bind_rows(svy_1, svy_2, .id = "time")

psych::describeBy(svy, "time")

## 
##  Descriptive statistics by group 
## group: 1
##                vars   n mean   sd median trimmed  mad   min   max range  skew
## time*             1 300 1.00 0.00      1    1.00 0.00  1.00  1.00  0.00   NaN
## well_made         2 300 1.50 0.82      1    1.33 0.00  1.00  5.00  4.00  2.26
## consistent        3 300 1.50 0.87      1    1.30 0.00  1.00  5.00  4.00  2.19
## poor_workman_r    4 300 1.37 0.80      1    1.18 0.00  1.00  5.00  4.00  2.68
## higher_price      5 300 2.63 1.44      2    2.53 1.48  1.00  5.00  4.00  0.41
## lot_more          6 300 3.43 1.47      4    3.54 1.48  1.00  5.00  4.00 -0.34
## go_up             7 300 3.21 1.43      3    3.26 1.48  1.00  5.00  4.00 -0.20
## stands_out        8 300 2.12 1.21      2    1.94 1.48  1.00  5.00  4.00  0.97
## unique            9 300 2.09 1.22      2    1.89 1.48  1.00  5.00  4.00  1.03
## one_of_a_kind    10 300 1.82 1.25      1    1.54 0.00  1.00  5.00  4.00  1.49
## spend            11 300 7.13 4.04      7    7.10 3.94 -5.59 18.92 24.51  0.02
##                kurtosis   se
## time*               NaN 0.00
## well_made          6.08 0.05
## consistent         5.11 0.05
## poor_workman_r     7.76 0.05
## higher_price      -1.17 0.08
## lot_more          -1.34 0.09
## go_up             -1.28 0.08
## stands_out         0.04 0.07
## unique             0.10 0.07
## one_of_a_kind      1.07 0.07
## spend             -0.09 0.23
## ------------------------------------------------------------ 
## group: 2
##                vars   n mean   sd median trimmed  mad   min   max range  skew
## time*             1 300 1.00 0.00   1.00    1.00 0.00  1.00  1.00  0.00   NaN
## well_made         2 300 1.52 0.89   1.00    1.32 0.00  1.00  5.00  4.00  2.21
## consistent        3 300 1.48 0.85   1.00    1.28 0.00  1.00  5.00  4.00  2.29
## poor_workman_r    4 300 1.39 0.81   1.00    1.20 0.00  1.00  5.00  4.00  2.60
## higher_price      5 300 2.46 1.36   2.00    2.33 1.48  1.00  5.00  4.00  0.58
## lot_more          6 300 3.34 1.50   3.00    3.42 2.97  1.00  5.00  4.00 -0.22
## go_up             7 300 3.05 1.45   3.00    3.07 1.48  1.00  5.00  4.00 -0.03
## stands_out        8 300 1.98 1.16   2.00    1.77 1.48  1.00  5.00  4.00  1.15
## unique            9 300 1.92 1.15   2.00    1.71 1.48  1.00  5.00  4.00  1.22
## one_of_a_kind    10 300 1.73 1.21   1.00    1.44 0.00  1.00  5.00  4.00  1.68
## spend            11 300 7.14 3.77   7.05    7.08 3.89 -1.96 16.99 18.94  0.11
##                kurtosis   se
## time*               NaN 0.00
## well_made          5.26 0.05
## consistent         5.70 0.05
## poor_workman_r     7.40 0.05
## higher_price      -0.85 0.08
## lot_more          -1.43 0.09
## go_up             -1.34 0.08
## stands_out         0.47 0.07
## unique             0.60 0.07
## one_of_a_kind      1.73 0.07
## spend             -0.37 0.22

brand_rep_test_retest <- psych::testRetest(
  t1 = filter(svy, time == 1),
  t2 = filter(svy, time == 2),
  id = "id")

## boundary (singular) fit: see ?isSingular

brand_rep_test_retest$r12

## [1] -0.08346322

If the correlation of scaled scores across time 1 and time 2 is greater than .9, that indicates very strong test-retest reliability. This measure can be difficult to collect because it requires the same respondents to answer the survey at two points in time. However, it’s a good technique to have in your survey development toolkit.

When validating a scale, it’s a good idea to split the survey results into two samples, using one for EFA and one for CFA. This works as a sort of cross-validation such that the overall fit of the model is less likely due to chance of any one sample’s makeup.

brand_rep_efa_data <- brand_rep[1:280,]
brand_rep_cfa_data <- brand_rep[281:559,]
 
efa <- psych::fa(brand_rep_efa_data, nfactors = 3)
efa$loadings

## 
## Loadings:
##                MR1    MR2    MR3   
## well_made       0.879              
## consistent      0.931              
## poor_workman_r  0.795              
## higher_price    0.139  0.640  0.137
## lot_more               0.868       
## go_up                  0.893       
## stands_out                    0.980
## unique                        0.942
## one_of_a_kind   0.360  0.279  0.132
## spend           0.107  0.223       
## 
##                  MR1   MR2   MR3
## SS loadings    2.433 2.091 1.896
## Proportion Var 0.243 0.209 0.190
## Cumulative Var 0.243 0.452 0.642

brand_rep_cfa <- lavaan::cfa(brand_rep_mdl, data = brand_rep_cfa_data)
lavaan::inspect(brand_rep_cfa, what = "call")

## [[1]]
## lavaan::lavaan
## 
## $model
## brand_rep_mdl
## 
## $data
## brand_rep_cfa_data
## 
## $model.type
## [1] "cfa"
## 
## $int.ov.free
## [1] TRUE
## 
## $int.lv.free
## [1] FALSE
## 
## $auto.fix.first
## [1] TRUE
## 
## $auto.fix.single
## [1] TRUE
## 
## $auto.var
## [1] TRUE
## 
## $auto.cov.lv.x
## [1] TRUE
## 
## $auto.cov.y
## [1] TRUE
## 
## $auto.th
## [1] TRUE
## 
## $auto.delta
## [1] TRUE
## 
## $auto.efa
## [1] TRUE

lavaan::fitmeasures(brand_rep_cfa)[c("cfi","tli","rmsea")]

##        cfi        tli      rmsea 
## 0.97503646 0.95888359 0.09006769