Setup

Packages

library(pacman)
p_load(dplyr, psych, lavaan)

Functions

BESDS <- function(r){
  PC = 0.5 * abs(r) + 0.5
  return(PC)}

OmegaU <- function(x){
  require(lavaan)
  IN <- inspect(x, 'std')$lambda #maybe ^2? I don't think so due to path tracing rules
  GL <- sum(IN); DT <- attributes(IN)$dimnames[[1]]
  FL <- x@call; DF <- eval(FL$data); DD <- DF[DT]
  CM <- ((sum(cor(DD))) - length(DD))/2
  RE <- GL/CM
  return(RE)}

The latter function is to compute \(\omega\) from a single factor lavaan model.

Data

N79 <- read.csv("NLS79SES.csv")
N97 <- read.csv("NLS97SES.csv")

Rationale

Gregory Clark takes a latent variable approach to social status in order to account for measurement error. There are a few assumptions of the model, like invariance between generations, that have, to my knowledge, not been looked at. Another assumption that is critical for even using a latent variable in the first place is local independence. If manifest variables used to indicate some sort of latent social status are residually correlated, then there may be problems using a latent variable. This may be what's expected for a few reasons. Namely, that these variables are often causally related. For instance, a prestigious occupation can cause high income, or if we use modern measurements, low income can cause welfare use, bankruptcy, or giving up the pursuit of education, which, itself, may cause higher incomes (in a sense). So perhaps a latent variable model will be inappropriate; the do-calculus is certainly uncertain on this question. If it turns out a latent variable model is not appropriate because, say, the measures are not unidimensional enough, it may be better to use PCA for a summary, but this loses the measurement error reducing benefit, even if it leads to variance reduction. If that is the case, other means of correcting for measurement error, both in variables and in sumscores, should be considered.

I will test if some social status measures yield a (1) fitting latent variable model with (2) adequate local independence in the NLSY 1979 and 1997 cohorts. For the NLSY '79, I will be using (and am open to suggestions)

Total welfare received (total 2010:2016)
Wages/salary/tips (total 2010:2016)
Weeks of unemployment (total 2010:2016)
Poverty status (total 2010:2016)
Highest degree received (as of latest sampling, i.e., 2016)
Wealth (total 2012, 2016)
Market value of residential properties (total 2012, 2016)
Bankruptcy declarations (as of latest sampling)
Number of businesses ever owned (as of latest sampling)

Because the variables changed, for the NLSY '97, I will be using

Poverty ratio (total 2011:2017)
Total income (total 2011:2017)
Unemployed months (total 2012:2018)
Number of businesses reported (as of latest sampling)
Business or partnership market values (as of latest sampling)
Highest degree received (latest sampling)
Net worth (age 35)
Market value of residence (as of latest sampling)
Bankruptcy declarations (as of latest sampling)

I do not know why the '97 stopped collecting data on welfare receipt in 2009. I doubt it was because it became too common or something like that.

Analysis

National Longitudinal Survey of Youth - 1979

N79 <- N79 %>% rename(
  W10   = G0260300, #total welfare received, last year
  W12   = G0276000,
  W14   = G0292700,
  W16   = G0313100,
  WST10 = T3045300, #wages/salary/tips, last year
  WST12 = T3977400,
  WST14 = T4915800,
  WST16 = T5619500,
  UNE10 = T3109000, #unemployment weeks, last year
  UNE12 = T4113500,
  UNE14 = T5024900,
  UNE16 = T5772900,
  POV10 = T3108000, #poverty status, last year
  POV12 = T4112500,
  POV14 = T5022800,
  POV16 = T5770900,
  EDU   = T5177500, #highest degree received, latest year
  WEA12 = T4045802, #net wealth, last year
  WEA16 = T5684500,
  RES12 = T4049500, #market value of residential properties
  RES16 = T5687800,
  BANKR = T5732100, #number of bankruptcies, latest year
  ENTRE = X0045900, #number of businesses ever owned
  AFQT  = R0618200  #AFQT percentile score
  ) %>% mutate(
  WT  = W10 + W12 + W14 + W16,
  INC = WST10 + WST12 + WST14 + WST16,
  UNE = UNE10 + UNE12 + UNE14 + UNE16,
  POV = POV10 + POV12 + POV14 + POV16,
  WEA = WEA12 + WEA16,
  RES = RES12 + RES16)

describe(N79)

SESM79 <- '
SES =~ WT + INC + UNE + POV + EDU + WEA + RES + BANKR + ENTRE'

FIT79 <- sem(SESM79, data = N79, std.lv = T, std.ov = T)
fitMeasures(FIT79)

##                npar                fmin               chisq                  df 
##              18.000               0.202            5137.367              27.000 
##              pvalue      baseline.chisq         baseline.df     baseline.pvalue 
##               0.000           51327.367              36.000               0.000 
##                 cfi                 tli                nnfi                 rfi 
##               0.900               0.867               0.867               0.867 
##                 nfi                pnfi                 ifi                 rni 
##               0.900               0.675               0.900               0.900 
##                logl   unrestricted.logl                 aic                 bic 
##         -138906.388         -136337.704          277848.776          277982.844 
##              ntotal                bic2               rmsea      rmsea.ci.lower 
##           12686.000          277925.642               0.122               0.119 
##      rmsea.ci.upper        rmsea.pvalue                 rmr          rmr_nomean 
##               0.125               0.000               0.078               0.078 
##                srmr        srmr_bentler srmr_bentler_nomean                crmr 
##               0.078               0.078               0.078               0.087 
##         crmr_nomean          srmr_mplus   srmr_mplus_nomean               cn_05 
##               0.087               0.078               0.078             100.054 
##               cn_01                 gfi                agfi                pgfi 
##             116.968               0.916               0.860               0.549 
##                 mfi                ecvi 
##               0.818               0.408

resid(FIT79, "cor")

## $type
## [1] "cor.bollen"
## 
## $cov
##       WT     INC    UNE    POV    EDU    WEA    RES    BANKR  ENTRE 
## WT     0.000                                                        
## INC   -0.197  0.000                                                 
## UNE    0.040 -0.138  0.000                                          
## POV    0.026  0.011  0.017  0.000                                   
## EDU   -0.019 -0.007 -0.011 -0.012  0.000                            
## WEA   -0.119  0.399 -0.101  0.001 -0.002  0.000                     
## RES   -0.014 -0.004 -0.007 -0.001  0.012 -0.005  0.000              
## BANKR -0.010 -0.055  0.008 -0.005  0.037 -0.110  0.005  0.000       
## ENTRE -0.033  0.067 -0.022 -0.002  0.002  0.079 -0.003  0.025  0.000

mean(abs(resid(FIT79, "cor")$cov))

## [1] 0.03962961

OmegaU(FIT79)

## [1] 0.4778986

Vars79 <- c("WT", "INC", "UNE", "POV", "EDU", "WEA", "RES", "BANKR", "ENTRE", "AFQT"); Vars79 <- N79[Vars79]
FC <- cor(Vars79); FC; FCA <- FC[,'AFQT']; #FCA[1:length(FCA)-1]

##                WT         INC        UNE         POV         EDU         WEA
## WT     1.00000000 -0.09747381  0.1722029  0.23860283  0.13920528 -0.05644752
## INC   -0.09747381  1.00000000  0.1295101  0.44078113  0.31229458  0.52473548
## UNE    0.17220294  0.12951011  1.0000000  0.58988039  0.41486146  0.06735660
## POV    0.23860283  0.44078113  0.5898804  1.00000000  0.67131737  0.27110915
## EDU    0.13920528  0.31229458  0.4148615  0.67131737  1.00000000  0.19844436
## WEA   -0.05644752  0.52473548  0.0673566  0.27110915  0.19844436  1.00000000
## RES    0.19791321  0.42578495  0.5651300  0.91814629  0.69494552  0.26437182
## BANKR  0.10595509  0.17987660  0.3213307  0.49751425  0.41053100  0.03691875
## ENTRE  0.01675670  0.16728454  0.1116439  0.21296126  0.16120409  0.14241164
## AFQT  -0.14592429  0.25752004 -0.1098700 -0.04457269 -0.02041383  0.23641176
##               RES       BANKR      ENTRE        AFQT
## WT     0.19791321  0.10595509 0.01675670 -0.14592429
## INC    0.42578495  0.17987660 0.16728454  0.25752004
## UNE    0.56512995  0.32133073 0.11164389 -0.10987001
## POV    0.91814629  0.49751425 0.21296126 -0.04457269
## EDU    0.69494552  0.41053100 0.16120409 -0.02041383
## WEA    0.26437182  0.03691875 0.14241164  0.23641176
## RES    1.00000000  0.50618267 0.21147473 -0.04137285
## BANKR  0.50618267  1.00000000 0.14210627 -0.06036245
## ENTRE  0.21147473  0.14210627 1.00000000  0.07561672
## AFQT  -0.04137285 -0.06036245 0.07561672  1.00000000

MCA <- mean(abs(FCA[1:length(FCA)-1])); MCA; BESDS(MCA)

## [1] 0.1102294

## [1] 0.5551147

\(\omega\) seems low and the average residual correlation is very small, but the average masks potentially important variation within the matrix. For example, welfare usage is residually negatively correlated with income and wealth in addition to a positive correlation with unemployment and poverty. Income and unemployment are residually negatively correlated, and income is residually correlated with greater education and wealth, as well as a lower number of bankruptcies and a greater number of previously owned businesses. Unemployment is residually negatively related to wealth, and wealth is residually negatively related to bankruptcy and positively related to number of businesses previously owned. What "latent socioeconomic status" explains in these variables amounts to only 1 - (0.399/0.525) = 24% of the correlation between income and wealth. Using a latent socioeconomic status variable model or variable may end up being problematic. To clarify something, the reason these small deviations (compared to, say, cognitive data, where correlation matrices are much stronger) may be concerning is because the latent variable doesn't explain much anyway:

mean(inspect(FIT79, 'std')$lambda^2)

## [1] 0.3728846

Notably, we would not expect much difference with PCA (components are sumscores, not latent variables):

IN <- inspect(FIT79, 'std')$lambda; DT <- attributes(IN)$dimnames[[1]]; FL <- FIT79@call; DF <- eval(FL$data); DD <- DF[DT]
fa(DD, method = "pca")

## Factor Analysis using method =  minres
## Call: fa(r = DD, method = "pca")
## Standardized loadings (pattern matrix) based upon correlation matrix
##        MR1    h2    u2 com
## WT    0.19 0.035 0.965   1
## INC   0.45 0.200 0.800   1
## UNE   0.57 0.322 0.678   1
## POV   0.97 0.936 0.064   1
## EDU   0.72 0.514 0.486   1
## WEA   0.30 0.090 0.910   1
## RES   0.96 0.924 0.076   1
## BANKR 0.52 0.267 0.733   1
## ENTRE 0.24 0.057 0.943   1
## 
##                 MR1
## SS loadings    3.34
## Proportion Var 0.37
## 
## Mean item complexity =  1
## Test of the hypothesis that 1 factor is sufficient.
## 
## The degrees of freedom for the null model are  36  and the objective function was  4.05 with Chi Square of  51307.81
## The degrees of freedom for the model are 27  and the objective function was  0.42 
## 
## The root mean square of the residuals (RMSR) is  0.09 
## The df corrected root mean square of the residuals is  0.1 
## 
## The harmonic number of observations is  12686 with the empirical chi square  6762  with prob <  0 
## The total number of observations was  12686  with Likelihood Chi Square =  5351.73  with prob <  0 
## 
## Tucker Lewis Index of factoring reliability =  0.862
## RMSEA index =  0.125  and the 90 % confidence intervals are  0.122 0.128
## BIC =  5096.62
## Fit based upon off diagonal values = 0.94
## Measures of factor score adequacy             
##                                                    MR1
## Correlation of (regression) scores with factors   0.99
## Multiple R square of scores with factors          0.97
## Minimum correlation of possible factor scores     0.95

At least in this dataset and with these indicators, there is not likely to be a general factor of socioeconomic status. Perhap using only (1) income; (2) education (3) occupational score, as Clark practically does, would change things. I may attempt that later. Using more indicators should not make for a worse model, excepting the penalty on fit that accrues normally for single-factor models and may necessitate, perhaps, moving to a higher-order (or some other variety of) model.

It is interesting and worth noting that attained wealth correlates more with the AFQT score than the education measure, which may not be highest degree attained (it's listed as highest grade; I will check this later; this is preliminary). It does not correlate more than income though. Attainment correlations contrast with parental SES-child IQ correlations, which are low for wealth and comparable quite high for parental education. The differences between attained and parental SES measure correlations are interesting for cross-racial comparisons, because the latter tend to be lower for lower-performing groups due to regression to the mean (Jensen discussed this in The g Factor), although, validating that perspective, the former are usually almost the same. Because, earlier, I found something like a 0.5 correlation with a measure someone else sent me, I think I'm probably using the wrong one for education. If that's the case, this looks like an almost Sowellian pattern of correlations, where you're much less likely to be poor if you just graduate high school.

National Longitudinal Survey of Youth - 1997

N97 <- N97 %>% rename(
  POV11  = T6656800, #poverty ratio - higher = less poverty
  POV13  = T8129200,
  POV15  = U0009000,
  POV17  = U1845600,
  INC11  = T7545600, #total wages/income/salary
  INC13  = T8976700,
  INC15  = U0956900,
  INC17  = U2857200,
  RES    = U2865500, #residential property value
  BUS    = Z0515400, #businesses reported
  BPV    = Z0515800, #market value of businesses/partnerships
  BANKR  = Z0525200, #number of bankruptcies
  UNE12  = Z9076003, #unemployment months
  UNE13  = Z9076004,
  UNE14  = Z9076005,
  UNE15  = Z9076006,
  UNE16  = Z9076007,
  UNE17  = Z9076008,
  UNE18  = Z9076009,
  EDU    = Z9083900, #highest degree attained
  NETW   = Z9141400, #net worth
  ASVAB  = R9829600  #ASVAB math-verbal percentile score
  ) %>% mutate(
  POV = POV11 + POV13 + POV15 + POV17,
  INC  = INC11 + INC13 + INC15 + INC17,
  UNE = UNE12 + UNE13 + UNE14 + UNE15 + UNE16 + UNE17 + UNE18)

describe(N97)

SESM97 <- '
SES =~ POV + INC + UNE + RES + BUS + BPV + BANKR + EDU + NETW'

FIT97 <- sem(SESM97, data = N97, std.lv = T, std.ov = T)
fitMeasures(FIT97)

##                npar                fmin               chisq                  df 
##              18.000               0.087            1563.725              27.000 
##              pvalue      baseline.chisq         baseline.df     baseline.pvalue 
##               0.000           16395.889              36.000               0.000 
##                 cfi                 tli                nnfi                 rfi 
##               0.906               0.875               0.875               0.873 
##                 nfi                pnfi                 ifi                 rni 
##               0.905               0.678               0.906               0.906 
##                logl   unrestricted.logl                 aic                 bic 
##         -107309.112         -106527.249          214654.223          214782.081 
##              ntotal                bic2               rmsea      rmsea.ci.lower 
##            8984.000          214724.880               0.080               0.076 
##      rmsea.ci.upper        rmsea.pvalue                 rmr          rmr_nomean 
##               0.083               0.000               0.045               0.045 
##                srmr        srmr_bentler srmr_bentler_nomean                crmr 
##               0.045               0.045               0.045               0.050 
##         crmr_nomean          srmr_mplus   srmr_mplus_nomean               cn_05 
##               0.050               0.045               0.045             231.461 
##               cn_01                 gfi                agfi                pgfi 
##             270.814               0.966               0.943               0.580 
##                 mfi                ecvi 
##               0.918               0.178

resid(FIT97, "cor")

## $type
## [1] "cor.bollen"
## 
## $cov
##       POV    INC    UNE    RES    BUS    BPV    BANKR  EDU    NETW  
## POV    0.000                                                        
## INC    0.010  0.000                                                 
## UNE   -0.001 -0.030  0.000                                          
## RES   -0.037 -0.028  0.205  0.000                                   
## BUS    0.017 -0.036  0.012  0.022  0.000                            
## BPV   -0.004 -0.001 -0.007  0.000  0.105  0.000                     
## BANKR -0.005 -0.007  0.053  0.035  0.023 -0.004  0.000              
## EDU    0.006  0.012 -0.067  0.027 -0.007 -0.013 -0.014  0.000       
## NETW  -0.009  0.001 -0.054  0.115  0.028  0.017 -0.018 -0.060  0.000

mean(abs(resid(FIT97, "cor")$cov))

## [1] 0.02689205

OmegaU(FIT97)

## [1] 0.5928947

\(\omega\) is higher but, based on some previous simulations, probably not high enough to justify a general factor of socioeconomic status in this dataset, as in the first one, but there is evidence for more, meaningful local independence among these measures. Deviations are mostly obvious, like a connection between residence value and net worth and between the number of businesses and the value of one's businesses and partnerships. These sorts of residual relationships hardly threaten the validity of a latent SES factor, but the lack of general dimensionality does. A better strategy is some form of dimension reduction or accounting for reliability by using a sumscore made from many measures (composite reliabilities are trivial to compute).

Vars97 <- c("POV", "INC", "UNE", "RES", "BUS", "BPV", "BANKR", "EDU", "NETW", "ASVAB"); Vars97 <- N97[Vars97]
FC <- cor(Vars97); FC; FCA <- FC[,'ASVAB']; #FCA[1:length(FCA)-1]

##              POV        INC         UNE        RES        BUS          BPV
## POV   1.00000000 0.72778674 0.362191428 0.43142582 0.15027002  0.027041565
## INC   0.72778674 1.00000000 0.319943226 0.42353802 0.09344531  0.028270877
## UNE   0.36219143 0.31994323 1.000000000 0.43312291 0.07738759  0.007664619
## RES   0.43142582 0.42353802 0.433122912 1.00000000 0.10651576  0.019113095
## BUS   0.15027002 0.09344531 0.077387593 0.10651576 1.00000000  0.110074367
## BPV   0.02704156 0.02827088 0.007664619 0.01911309 0.11007437  1.000000000
## BANKR 0.01544829 0.01236054 0.063230431 0.04799598 0.02624845 -0.003545517
## EDU   0.43304105 0.42331110 0.141316102 0.29541903 0.06932037  0.004493055
## NETW  0.40867473 0.40429284 0.150069400 0.37773597 0.10264323  0.034471934
## ASVAB 0.33908412 0.32236639 0.040989239 0.20230461 0.08541391  0.019569815
##              BANKR          EDU         NETW       ASVAB
## POV    0.015448291  0.433041052  0.408674729  0.33908412
## INC    0.012360541  0.423311097  0.404292838  0.32236639
## UNE    0.063230431  0.141316102  0.150069400  0.04098924
## RES    0.047995975  0.295419031  0.377735969  0.20230461
## BUS    0.026248447  0.069320365  0.102643225  0.08541391
## BPV   -0.003545517  0.004493055  0.034471934  0.01956981
## BANKR  1.000000000 -0.002409366 -0.006535394 -0.01149917
## EDU   -0.002409366  1.000000000  0.179889439  0.47341150
## NETW  -0.006535394  0.179889439  1.000000000  0.16839567
## ASVAB -0.011499169  0.473411502  0.168395669  1.00000000

MCA <- mean(abs(FCA[1:length(FCA)-1])); MCA; BESDS(MCA)

## [1] 0.1847816

## [1] 0.5923908

Again, the education measure in the first dataset may have been wrong. Even if that changed things, it's hardly plausible that the failure of this model would be overturned. Regarding the total amount this second model explains:

mean(inspect(FIT97, 'std')$lambda^2)

## [1] 0.2680762

And the result is still not very different from PCA:

IN <- inspect(FIT97, 'std')$lambda; DT <- attributes(IN)$dimnames[[1]]; FL <- FIT97@call; DF <- eval(FL$data); DD <- DF[DT]
fa(DD, method = "pca")

## Factor Analysis using method =  minres
## Call: fa(r = DD, method = "pca")
## Standardized loadings (pattern matrix) based upon correlation matrix
##        MR1     h2   u2 com
## POV   0.84 0.7107 0.29   1
## INC   0.81 0.6518 0.35   1
## UNE   0.44 0.1944 0.81   1
## RES   0.60 0.3657 0.63   1
## BUS   0.16 0.0265 0.97   1
## BPV   0.04 0.0016 1.00   1
## BANKR 0.03 0.0011 1.00   1
## EDU   0.48 0.2282 0.77   1
## NETW  0.49 0.2406 0.76   1
## 
##                 MR1
## SS loadings    2.42
## Proportion Var 0.27
## 
## Mean item complexity =  1
## Test of the hypothesis that 1 factor is sufficient.
## 
## The degrees of freedom for the null model are  36  and the objective function was  1.83 with Chi Square of  16387.07
## The degrees of freedom for the model are 27  and the objective function was  0.19 
## 
## The root mean square of the residuals (RMSR) is  0.05 
## The df corrected root mean square of the residuals is  0.05 
## 
## The harmonic number of observations is  8984 with the empirical chi square  1461.22  with prob <  1.9e-291 
## The total number of observations was  8984  with Likelihood Chi Square =  1687.96  with prob <  0 
## 
## Tucker Lewis Index of factoring reliability =  0.865
## RMSEA index =  0.083  and the 90 % confidence intervals are  0.079 0.086
## BIC =  1442.17
## Fit based upon off diagonal values = 0.97
## Measures of factor score adequacy             
##                                                    MR1
## Correlation of (regression) scores with factors   0.92
## Multiple R square of scores with factors          0.85
## Minimum correlation of possible factor scores     0.70

Discussion

I don't know what the reasons are to believe in a latent socioeconomic status variable. If it is real, it seems to not be very general. Because higher-order loadings are consistent between good single-factor, STCU, higher-order, and bifactor models, mathematically, it seems unlikely that much will change this. The use of latent variables seems dubious. The use of measures weighted based on the loadings from them seems even worse: why make a composite from something with a weight of 0.03 among, generously, a dozen variables? Latent variable modeling may generate results consistent with low mobility because it is a bad (and indeed, perhaps inappropriate) means of summarizing measures. Even the first principal component does not seem usable as a summary of latent socioeconomic status, trait, summary, or otherwise. Unless someone really wants to argue that what socioeconomic status measures have in common is all (or a reasonable approximation of) what's reliable for them (for all I know, this could be true), a probably far better strategy would be to use

\[X_t = \frac{\sum_{i=1}^kX_i-K}{\sqrt{\sum_{i=1}^k \sum_{j=1}^k (r_{ij})}}\]

with reliability equal to

\[\frac{\sum_{i=1}^k \sum_{j=1}^k r_{ij} - K + \sum_{i=1}^k r_i}{\sum_{i=1}^k \sum_{j=1}^k r_{ij}}\]

If anyone recalls the sources for these, do tell.

Maybe this is alright and, say, the amount of reliable variance in socioeconomic status is just what the measures (how are these selected?) have in common, but I've never seen anyone justify this and I feel it should be broken down into multiple subdomains instead of one broad domain in any case. The theoretical implications of latent variable models versus other means of dimensionality reduction should be reckoned with. If the result is fine, using a sumscore should generate comparable results. This does not justify the result, it's just a necessary corollary. I would be surprised if the total variance to be explained was like the amount observed in psychology.

Local Independence for Latent Socioeconomic Status

Jay Lasker

September 6, 2020